Grouping dates together into periods

kentm · May 7, 2020, 7:29pm

I'm trying to solve a problem of grouping dates together based on certain length of time from a starting date. For example, given the following data:

library(dplyr)

test_df <- tibble::tribble(
	~id, ~date,
	1,'01/01/2011',
	2,'01/05/2011',
	3,'01/10/2011',
	4,'02/01/2011',
	5,'02/10/2011',
	6,'02/11/2011',
	7,'02/19/2011',
	8,'03/01/2011',
	9,'03/02/2011',
	10,'03/03/2011',
	11,'03/04/2011',
	12,'04/29/2011',
	13,'05/01/2011',
	14,'05/02/2011',
	15,'05/29/2011',
	16,'05/30/2011',
	17,'06/01/2011',
	18,'06/02/2011',
	19,'06/03/2011',
	20,'06/10/2011'
) %>%
	mutate(date = lubridate::mdy(date))

the goal would be to group all dates within 28 days (this could be varied) from the minimum date. That is, at first pass, the starting date would be 01/01/2011 and we would group all dates 28 days out. Then, we would start over at the next non-grouped date which would be 02/01/2011 and so on until all the dates were grouped.

I realize this can be solved pretty easy with a for loop, but I'm wondering if anyone can think of any simple / creative solutions to solve this in a more tidyverse / functional way.

Thanks!

andresrcs · May 7, 2020, 11:53pm

Is this what you mean?

library(tidyverse)

tibble::tribble(
    ~id, ~date,
    1,'01/01/2011',
    2,'01/05/2011',
    3,'01/10/2011',
    4,'02/01/2011',
    5,'02/10/2011',
    6,'02/11/2011',
    7,'02/19/2011',
    8,'03/01/2011',
    9,'03/02/2011',
    10,'03/03/2011',
    11,'03/04/2011',
    12,'04/29/2011',
    13,'05/01/2011',
    14,'05/02/2011',
    15,'05/29/2011',
    16,'05/30/2011',
    17,'06/01/2011',
    18,'06/02/2011',
    19,'06/03/2011',
    20,'06/10/2011'
) %>%
    mutate(date = lubridate::mdy(date),
           group = cut(date, "28 days"))
#> # A tibble: 20 x 3
#>       id date       group     
#>    <dbl> <date>     <fct>     
#>  1     1 2011-01-01 2011-01-01
#>  2     2 2011-01-05 2011-01-01
#>  3     3 2011-01-10 2011-01-01
#>  4     4 2011-02-01 2011-01-29
#>  5     5 2011-02-10 2011-01-29
#>  6     6 2011-02-11 2011-01-29
#>  7     7 2011-02-19 2011-01-29
#>  8     8 2011-03-01 2011-02-26
#>  9     9 2011-03-02 2011-02-26
#> 10    10 2011-03-03 2011-02-26
#> 11    11 2011-03-04 2011-02-26
#> 12    12 2011-04-29 2011-04-23
#> 13    13 2011-05-01 2011-04-23
#> 14    14 2011-05-02 2011-04-23
#> 15    15 2011-05-29 2011-05-21
#> 16    16 2011-05-30 2011-05-21
#> 17    17 2011-06-01 2011-05-21
#> 18    18 2011-06-02 2011-05-21
#> 19    19 2011-06-03 2011-05-21
#> 20    20 2011-06-10 2011-05-21

^{Created on 2020-05-07 by the reprex package (v0.3.0)}

kentm · May 8, 2020, 12:24am

Thanks for the reply, I'm actually looking to create the groups based on all dates that fall within 28 days after the start date. This could either be a number (i.e. 1, 2, 3) or the minimum date of that grouping. The expected output would look like this:

tibble::tribble(
  ~id, ~date, ~date_group
  1,'01/01/2011','01/01/2011',
  2,'01/05/2011','01/01/2011',
  3,'01/10/2011','01/01/2011',
  4,'02/01/2011','02/01/2011',
  5,'02/10/2011','02/01/2011',
  6,'02/11/2011','02/01/2011',
  7,'02/19/2011','02/01/2011',
  8,'03/01/2011','02/01/2011',
  9,'03/02/2011','03/02/2011',
  10,'03/03/2011','03/02/2011',
  11,'03/04/2011','03/02/2011',
  12,'04/29/2011','04/29/2011',
  13,'05/01/2011','04/29/2011',
  14,'05/02/2011','04/29/2011',
  15,'05/29/2011','05/29/2011',
  16,'05/30/2011','05/29/2011',
  17,'06/01/2011','05/29/2011',
  18,'06/02/2011','05/29/2011',
  19,'06/03/2011','05/29/2011',
  20,'06/10/2011','05/29/2011'
)

We start at 01/01/2011 + 28 days would group all dates between 01/01/2011 and 01/29/2011, then we go to 02/01/2011 + 28 days and group all the dates in that range (02/01/2011 to 03/01/2011), then 03/02/2011 + 28 days and so on. Basically starting with the minimum date + 28 days and re-indexing on the next minimum date that does not fall in the prior 28 day period.

Ajackson · May 8, 2020, 1:09pm

The "incidence" package will do exactly what you want, I think. I use it to convert my migraine numbers to a 28-day month basis.

kentm · May 8, 2020, 1:32pm

Thanks! Looks like this doesn't quite get at it. Doing something like:

incidence(dates, interval = "28 days")

returns 28 day intervals that start from the very first date. However, after the first 28 days, the starting date needs to be the next available minimum date. For example, if we had dates 01/01/2011, 01/02/2011 and 01/01/2013 -- I wouldn't want 28 day intervals between 2011 - 2013, I would want the first 28 day interval to start on 01/01/2011 and then the next 28 day interval would start on 01/01/2013.

kentm · May 8, 2020, 1:35pm

For example, this code with a for loop works but it gets pretty slow as the data frame grows (I left the grouping ID variable in there for testing):

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

date_df <- tibble::tribble(
    ~id, ~date,
    1,'01/01/2011',
    1,'01/05/2011',
    1,'01/10/2011',
    1,'02/01/2011',
    1,'02/10/2011',
    1,'02/11/2011',
    1,'02/19/2011',
    1,'03/01/2011',
    1,'03/02/2011',
    1,'03/03/2011',
    1,'03/04/2011',
    2,'04/29/2011',
    2,'05/01/2011',
    2,'05/02/2011',
    2,'05/29/2011',
    3,'05/30/2011',
    3,'06/01/2011',
    3,'06/02/2011',
    3,'06/03/2011',
    3,'06/10/2011'
) %>%
    mutate(date = lubridate::mdy(date))

create_date_periods <- function(dates, time_period = 28) {
    
    # create a vector to hold the results
    return_vector <- structure(integer(length(dates)), class = "Date")
    
    for(i in seq_along(dates)) {
        # if this is the first record, set to minimum
        if (i == 1) min_date <- dates[i]
        
        # if date less than minimum date + time period, use minimum date
        if(dates[i] <= min_date + time_period) {
            
            return_vector[i] <- min_date
            
            # otherwise update minimum date to current date
        } else {
            min_date <- dates[i]
            return_vector[i] <- min_date
        }}
    
    return(return_vector)
}

# testing
date_df %>%
    # group_by(id) %>%
    mutate(
        period_start = create_date_periods(date, time_period = 28)
    )
#> # A tibble: 20 x 3
#>       id date       period_start
#>    <dbl> <date>     <date>      
#>  1     1 2011-01-01 2011-01-01  
#>  2     1 2011-01-05 2011-01-01  
#>  3     1 2011-01-10 2011-01-01  
#>  4     1 2011-02-01 2011-02-01  
#>  5     1 2011-02-10 2011-02-01  
#>  6     1 2011-02-11 2011-02-01  
#>  7     1 2011-02-19 2011-02-01  
#>  8     1 2011-03-01 2011-02-01  
#>  9     1 2011-03-02 2011-03-02  
#> 10     1 2011-03-03 2011-03-02  
#> 11     1 2011-03-04 2011-03-02  
#> 12     2 2011-04-29 2011-04-29  
#> 13     2 2011-05-01 2011-04-29  
#> 14     2 2011-05-02 2011-04-29  
#> 15     2 2011-05-29 2011-05-29  
#> 16     3 2011-05-30 2011-05-29  
#> 17     3 2011-06-01 2011-05-29  
#> 18     3 2011-06-02 2011-05-29  
#> 19     3 2011-06-03 2011-05-29  
#> 20     3 2011-06-10 2011-05-29

^{Created on 2020-05-08 by the reprex package (v0.3.0)}

andresrcs · May 8, 2020, 3:35pm

This is a little hacky but still simpler than a loop

library(tidyverse)

tibble::tribble(
    ~id, ~date,
    1,'01/01/2011',
    2,'01/05/2011',
    3,'01/10/2011',
    4,'02/01/2011',
    5,'02/10/2011',
    6,'02/11/2011',
    7,'02/19/2011',
    8,'03/01/2011',
    9,'03/02/2011',
    10,'03/03/2011',
    11,'03/04/2011',
    12,'04/29/2011',
    13,'05/01/2011',
    14,'05/02/2011',
    15,'05/29/2011',
    16,'05/30/2011',
    17,'06/01/2011',
    18,'06/02/2011',
    19,'06/03/2011',
    20,'06/10/2011'
) %>%
    mutate(date = lubridate::mdy(date),
           group = cut(date, "27 days")) %>% 
    group_by(group) %>% 
    mutate(date_group = min(date)) %>%
    ungroup() %>% 
    select(-group)
#> # A tibble: 20 x 3
#>       id date       date_group
#>    <dbl> <date>     <date>    
#>  1     1 2011-01-01 2011-01-01
#>  2     2 2011-01-05 2011-01-01
#>  3     3 2011-01-10 2011-01-01
#>  4     4 2011-02-01 2011-02-01
#>  5     5 2011-02-10 2011-02-01
#>  6     6 2011-02-11 2011-02-01
#>  7     7 2011-02-19 2011-02-01
#>  8     8 2011-03-01 2011-03-01
#>  9     9 2011-03-02 2011-03-01
#> 10    10 2011-03-03 2011-03-01
#> 11    11 2011-03-04 2011-03-01
#> 12    12 2011-04-29 2011-04-29
#> 13    13 2011-05-01 2011-04-29
#> 14    14 2011-05-02 2011-04-29
#> 15    15 2011-05-29 2011-05-29
#> 16    16 2011-05-30 2011-05-29
#> 17    17 2011-06-01 2011-05-29
#> 18    18 2011-06-02 2011-05-29
#> 19    19 2011-06-03 2011-05-29
#> 20    20 2011-06-10 2011-05-29

^{Created on 2020-05-08 by the reprex package (v0.3.0)}

kentm · May 8, 2020, 3:53pm

Thanks! That's close but it still misclassified some of the dates, which I think is due to the cut(date, "27 days") starting relative to the first date and not the minimum of the next period. The 2011-03-01 date should be grouped with 2011-02-01 since it is 28 days after 2011-02-01. However, when using cut, it belongs to a different group since the first three groups become:

2011-01-01
2011-01-29
2011-02-26

all of these are 28 days relative to the very first date of 2011-01-01

kentm · May 8, 2020, 7:07pm

Using a while loop is faster and more vectorized than a for loop but it doesn't feel right putting a while loop inside a function:

create_date_periods <- function(dates, time_period = 28) {
 # TODO: add some error checking

 # create a vector to hold the results
 return_vector <- structure(rep(NA_real_, length(dates)), class = "Date")
 
 # if any date in the vector is still missing, keep going
 while(any(is.na(return_vector))) {
  
  # set minimum date amongst the values that are missing
   min_date <- min(dates[is.na(return_vector)])
  
  # if the date falls in range of interest, set it to the minimum date
  return_vector[dates >= min_date & dates <= min_date + time_period] <- min_date
 }
 
 return(return_vector)
}

Ajackson · May 9, 2020, 12:25pm

To do the 28 day grouping on an annualized basis, create a "year" variable, and then do incidence inside a group_by(year).

system · May 30, 2020, 12:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.