Rolling Count on varying window sizes

seskaros · July 5, 2022, 8:29pm

I have a table with almost a million rows. I want to apply a rolling count of strings/"event" based on Date AND Category. So far, I was able to get the counts of events within each month and category using dplyr :

df %>% group_by(Date, Category) %>% mutate(n_distinct(event))

I want to be able to modify this code so I can perform a rolling count based on specified window. I do have a column called "Num_Months" and it is category-specific (ex: Category A would be 6 months, category B would be 3 months, Category C would be 12 months, etc). My Date column has dates ranging from as early as 2012.

I have looked into the slider package and "slide_index" seems like a good function for me to use as it's index specific (index would be my Date column) , but how would you include it in this code? I'm open to other options/solutions as well!

Here is a sample of my data:
df<- structure(list(Date = c("01-01-2020","01-01-2020", "01-01-2020", "01-01-2020","01-01-2020","01-01-2020", "02-01-2020", "02-01-2020", "02-01-2020", "02-01-2020", "02-01-2020", "03-01-2020", "03-01-2020", "04-01-2020", "04-01-2020", "05-01-2020", "05-01-2020"), Category = c("A", "A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "B", "C", "A", "C", "A", "B" ), Num_Months = c(6,6,6,3,12,12,6,3,3,3,12,3,12,6,12,6,3), Events= c("event1a", "event2a", "event1a", "event1b", "event1c", "event2c", "event3a", "event2b", "event3b", "event3b", "event3c", "event4b", "event4c", "event4a", "event5c", "event5a", "event 5b")), row.names = c(NA, 17L), class = "data.frame")

This is what I want the results to look like:
result <- structure(list(Date = c("01-01-2020","01-01-2020", "01-01-2020","02-01-2020", "02-01-2020", "02-01-2020", "03-01-2020", "03-01-2020","04-01-2020", "04-01-2020", "05-01-2020", "05-01-2020"), Category = c("A", "B", "C","A", "B", "C", "B", "C", "A", "C", "A", "B" ), rolling= c(2, 1,2,3,3,3,4,4,4, 5,5,4)), row.names = c(NA, 12L), class = "data.frame")

williaml · July 6, 2022, 1:25am

Hi, it might help if you included a reproducible example of your code (as in enough so that people can test out the code to get a solution)

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Dissipation123 · July 6, 2022, 5:44am

Hmm, you can summarize the data first then do a cumulative sum of what you want to count. In your example, some re-grouping would have to be done.

First the number of distinct events are summarized, then those are further added up by Category.

library(dplyr)

df %>% group_by(Date, Category) %>%
  summarize(count = n_distinct(Events)) %>% 
  ungroup() %>% 
  group_by(Category) %>% 
  mutate(cumulative_sum = cumsum(count))

seskaros · July 6, 2022, 11:13am

That is actually very close to what I want. Is there a way to specify the window size per category? For example, if it's category A, then perform a 6-month rolling sum; the sum of the current month plus n-1 months.

Dissipation123 · July 6, 2022, 12:33pm

I'm not quite able to visualize what sort of final dataset that you would need so it may be helpful to provide another example of how an input dataset would melt down to a results dataset, but you can introduce some more nuance into the mutate statement to specify different types of cumulative sums using a case_when statement.

Try to modify this sample code based on the different cumulative sums you need for each category:

library(dplyr)

df %>% group_by(Date, Category) %>%
  summarize(count = n_distinct(Events)) %>% 
  ungroup() %>% 
  group_by(Category) %>% 
  mutate(
    cumulative_sum = 
      case_when (
       Category == 'A' ~ 2 * cumsum(count),
       Category == 'B' ~ 4 * cumsum(count),
       TRUE ~ 6 * cumsum(count) #remaining cases
      )
    )

system · July 27, 2022, 12:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.