[dplyr/case_when/summarize] Conditional summary using `case_when` based on date windows

Hi Posit Community

Have a bit of an interesting case on my hand and nothing jumped out at me in the documentation.

I have a set of data not unlike the following example.

ID date visit type
000000 2022-02-21 2 Type_I
000000 2023-02-01 1 Type_I
000001 2023-03-18 1 Type_I
000001 2023-11-03 3 Type_II
000001 2022-01-31 2 Type_II
000002 2022-03-11 3 Type_I
000002 2022-09-04 4 Type_I
000002 2023-01-20 2 Type_I
000002 2022-02-14 1 Type_I
000003 2023-01-08 2 Type_I
000003 2023-10-12 3 Type_I
000003 2023-01-10 1 Type_I
000004 2023-12-21 2 Type_I
000004 2022-09-13 3 Type_I
000004 2023-09-28 1 Type_I
000004 2022-09-08 4 Type_I
000005 2022-05-12 3 Type_I
000005 2022-08-22 1 Type_I
000005 2022-06-20 2 Type_II
000006 2023-08-10 1 Type_I
df <- structure(list(ID = c("000000", "000000", "000001", "000001", 
"000001", "000002", "000002", "000002", "000002", "000003", "000003", 
"000003", "000004", "000004", "000004", "000004", "000005", "000005", 
"000005", "000006"), date = structure(c(19044, 19389, 19434, 
19664, 19023, 19062, 19239, 19377, 19037, 19365, 19642, 19367, 
19712, 19248, 19628, 19243, 19124, 19226, 19163, 19579), class = "Date"), 
    visit = c(2L, 1L, 1L, 3L, 2L, 3L, 4L, 2L, 1L, 2L, 3L, 1L, 
    2L, 3L, 1L, 4L, 3L, 1L, 2L, 1L), type = c("Type_I", "Type_I", 
    "Type_I", "Type_II", "Type_II", "Type_I", "Type_I", "Type_I", 
    "Type_I", "Type_I", "Type_I", "Type_I", "Type_I", "Type_I", 
    "Type_I", "Type_I", "Type_I", "Type_I", "Type_II", "Type_I"
    )), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
))

Where I need to set a conditional boolean flag based on the criteria:

  • At least one instance of "Type_II"
  • Two or more cases of "Type_I" that occur within at least 90 days of one another

So that the output would look something like this:

ID Flag
000000 0
000001 1
000002 1
000003 1
000004 1
000005 1
000006 0

My first instinct here was to use pivot_wider and add_count by ID to check for entries by participant and set up conditional criteria for subsequent summaries:

df2 <- df |> pivot_wider(names_from = type, values_from = date) |> add_count(ID)

And flag !is.na(Type_II) and between(Type_I<date_y>,Type_I<date_x>,(Type_I<date_x>+days(90)) across any ascending order of dates to conditionally set the flags, but I'm having difficulty understanding how to succinctly combine summarize and case_when to do this.

I would appreciate any suggestions on how this might be accomplished within the tidyverse framework or otherwise.

Thank you in advance!

Hi @TPDeRamus,

Here I do two processing steps - a mutate()to summarise per record for each ID, then a summarise() to see if any record per ID meets either one of the criteria:

library(tidyverse)
df <- structure(list(ID = c("000000", "000000", "000001", "000001", 
                            "000001", "000002", "000002", "000002", "000002", "000003", "000003", 
                            "000003", "000004", "000004", "000004", "000004", "000005", "000005", 
                            "000005", "000006"), date = structure(c(19044, 19389, 19434, 
                                                                    19664, 19023, 19062, 19239, 19377, 19037, 19365, 19642, 19367, 
                                                                    19712, 19248, 19628, 19243, 19124, 19226, 19163, 19579), class = "Date"), 
                     visit = c(2L, 1L, 1L, 3L, 2L, 3L, 4L, 2L, 1L, 2L, 3L, 1L, 
                               2L, 3L, 1L, 4L, 3L, 1L, 2L, 1L), type = c("Type_I", "Type_I", 
                                                                         "Type_I", "Type_II", "Type_II", "Type_I", "Type_I", "Type_I", 
                                                                         "Type_I", "Type_I", "Type_I", "Type_I", "Type_I", "Type_I", 
                                                                         "Type_I", "Type_I", "Type_I", "Type_I", "Type_II", "Type_I"
                               )), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
                               ))

df2 <- 
  df |> 
  arrange(ID, date, visit) |> 
  # work out individual flags
  mutate(
    # flag type_ii records
    flag_t2 = case_match(type, 'Type_II' ~ T, .default = F),
    
    # flag type_i records
    flag_t1 = case_match(type, 'Type_I' ~ T, .default = F),
    
    # flag records within 90 days of previous one
    flag_90d = date - lag(date, n = 1) <= 90,
    
    # calculate overall flag for record
    Flag = case_when(
      flag_t2 ~ T,
      flag_t1 & flag_90d ~ T,
      .default = F
    ),
    
    .by = ID
  ) |> 
  # summarise per ID
  summarise(
    Flag = max(Flag),
    .by = ID
  )

# show result
df2
#> # A tibble: 7 × 2
#>   ID      Flag
#>   <chr>  <int>
#> 1 000000     0
#> 2 000001     1
#> 3 000002     1
#> 4 000003     1
#> 5 000004     1
#> 6 000005     1
#> 7 000006     0

Created on 2024-09-11 with reprex v2.1.0

1 Like

@craig.parylo

Thank you!

This is an excellent solution to an issue that's been bugging me for some time.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.