Rolling Count on varying window sizes

I have a table with almost a million rows. I want to apply a rolling count of strings/"event" based on Date AND Category. So far, I was able to get the counts of events within each month and category using dplyr :

df %>% group_by(Date, Category) %>% mutate(n_distinct(event))

I want to be able to modify this code so I can perform a rolling count based on specified window. I do have a column called "Num_Months" and it is category-specific (ex: Category A would be 6 months, category B would be 3 months, Category C would be 12 months, etc). My Date column has dates ranging from as early as 2012.

I have looked into the slider package and "slide_index" seems like a good function for me to use as it's index specific (index would be my Date column) , but how would you include it in this code? I'm open to other options/solutions as well!

Here is a sample of my data:
df<- structure(list(Date = c("01-01-2020","01-01-2020", "01-01-2020", "01-01-2020","01-01-2020","01-01-2020", "02-01-2020", "02-01-2020", "02-01-2020", "02-01-2020", "02-01-2020", "03-01-2020", "03-01-2020", "04-01-2020", "04-01-2020", "05-01-2020", "05-01-2020"), Category = c("A", "A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "B", "C", "A", "C", "A", "B" ), Num_Months = c(6,6,6,3,12,12,6,3,3,3,12,3,12,6,12,6,3), Events= c("event1a", "event2a", "event1a", "event1b", "event1c", "event2c", "event3a", "event2b", "event3b", "event3b", "event3c", "event4b", "event4c", "event4a", "event5c", "event5a", "event 5b")), row.names = c(NA, 17L), class = "data.frame")

This is what I want the results to look like:
result <- structure(list(Date = c("01-01-2020","01-01-2020", "01-01-2020","02-01-2020", "02-01-2020", "02-01-2020", "03-01-2020", "03-01-2020","04-01-2020", "04-01-2020", "05-01-2020", "05-01-2020"), Category = c("A", "B", "C","A", "B", "C", "B", "C", "A", "C", "A", "B" ), rolling= c(2, 1,2,3,3,3,4,4,4, 5,5,4)), row.names = c(NA, 12L), class = "data.frame")

Hi, it might help if you included a reproducible example of your code (as in enough so that people can test out the code to get a solution)

Hmm, you can summarize the data first then do a cumulative sum of what you want to count. In your example, some re-grouping would have to be done.

First the number of distinct events are summarized, then those are further added up by Category.

library(dplyr)

df %>% group_by(Date, Category) %>%
  summarize(count = n_distinct(Events)) %>% 
  ungroup() %>% 
  group_by(Category) %>% 
  mutate(cumulative_sum = cumsum(count))
  

That is actually very close to what I want. Is there a way to specify the window size per category? For example, if it's category A, then perform a 6-month rolling sum; the sum of the current month plus n-1 months.

I'm not quite able to visualize what sort of final dataset that you would need so it may be helpful to provide another example of how an input dataset would melt down to a results dataset, but you can introduce some more nuance into the mutate statement to specify different types of cumulative sums using a case_when statement.

Try to modify this sample code based on the different cumulative sums you need for each category:

library(dplyr)

df %>% group_by(Date, Category) %>%
  summarize(count = n_distinct(Events)) %>% 
  ungroup() %>% 
  group_by(Category) %>% 
  mutate(
    cumulative_sum = 
      case_when (
       Category == 'A' ~ 2 * cumsum(count),
       Category == 'B' ~ 4 * cumsum(count),
       TRUE ~ 6 * cumsum(count) #remaining cases
      )
    )

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.