How to group a column based on conditional date rows

Khoajak · August 23, 2022, 4:23pm

Hello RStudio Community,
I have the following treatment data (Id, Startdate, Enddate, and trt). I would like to make a new combinedtrt column from trt, if for every Id, Startdate (current row) <Enddate (above row), then add the value of above row with the current one (e.g." B+C " for ID=1 and row 3). Please look at the final expected output and I highly appreciate your help in advance.
Thanks

# treatment data
data <- data.frame(Id = c(1L,	1L,	1L,	1L,	1L,	2L,	2L,	2L,	2L,	3L,	3L,	3L,	3L,	3L,	4L,	4L,	4L,	4L,	4L),
                   Startdate = c("2020-01-20",	"2020-02-20",	"2020-03-20",	"2020-04-20",	"2020-05-10",	"2020-01-20",	"2020-02-15",	"2020-03-20",	"2020-04-15",	"2020-05-10",	"2020-07-15",	"2020-08-15",	"2020-10-25",	"2020-11-10",	"2020-04-10",	"2020-04-10",	"2020-08-15",	"2020-10-25",	"2020-10-27"),
                   Enddate = c("2020-02-20",	"2020-03-25",	"2020-04-22",	"2020-05-15",	"2020-06-12",	"2020-02-20",	"2020-03-20",	"2020-04-22",	"2020-05-15",	"2020-06-12",	"2020-08-20",	"2020-09-22",	"2020-11-15",	"2021-01-12",	"2020-05-12",	"2020-08-20",	"2020-09-22",	"2020-11-15",	"2021-01-12"),
                   Trt = factor(c("A",	"B",	"C",	"A",	"D",	"A",	"B",	"C",	"D",	"D",	"B",	"C",	"C",	"D",	"D",	"B",	"C",	"C",	"D")),
                   stringsAsFactors = FALSE)

# expected output data
Id	Startdate	Enddate	Trt	Combinedtrt
1	2020-01-20	2020-02-20	A	
1	2020-02-20	2020-03-25	B	
1	2020-03-20	2020-04-22	C	B+C
1	2020-04-20	2020-05-15	A	C+A
1	2020-05-10	2020-06-12	D	A+D
2	2020-01-20	2020-02-20	A	
2	2020-02-15	2020-03-20	B	A+B
2	2020-03-20	2020-04-22	C	
2	2020-04-15	2020-05-15	D	C+D
3	2020-05-10	2020-06-12	D	
3	2020-07-15	2020-08-20	B	
3	2020-08-15	2020-09-22	C	B+C
3	2020-10-25	2020-11-15	C	
3	2020-11-10	2021-01-12	D	C+D
4	2020-04-10	2020-05-12	D	
4	2020-04-10	2020-08-20	B	D+B
4	2020-08-15	2020-09-22	C	B+C
4	2020-10-25	2020-11-15	C	
4	2020-10-27	2021-01-12	D	C+D

FactOREO · August 23, 2022, 5:43pm

Hello,

I think this does what you want:

library(data.table)
library(dplyr)

data |>
  group_by(Id) |>
  mutate(
    combined_treat = fcase(
      Startdate < shift(Enddate), paste(shift(Trt),Trt, sep = ' + '),
      default = NA_character_
      )
    )
#> # A tibble: 19 × 5
#> # Groups:   Id [4]
#>       Id Startdate  Enddate    Trt   combined_treat
#>    <int> <chr>      <chr>      <fct> <chr>         
#>  1     1 2020-01-20 2020-02-20 A     <NA>          
#>  2     1 2020-02-20 2020-03-25 B     <NA>          
#>  3     1 2020-03-20 2020-04-22 C     B + C         
#>  4     1 2020-04-20 2020-05-15 A     C + A         
#>  5     1 2020-05-10 2020-06-12 D     A + D         
#>  6     2 2020-01-20 2020-02-20 A     <NA>          
#>  7     2 2020-02-15 2020-03-20 B     A + B         
#>  8     2 2020-03-20 2020-04-22 C     <NA>          
#>  9     2 2020-04-15 2020-05-15 D     C + D         
#> 10     3 2020-05-10 2020-06-12 D     <NA>          
#> 11     3 2020-07-15 2020-08-20 B     <NA>          
#> 12     3 2020-08-15 2020-09-22 C     B + C         
#> 13     3 2020-10-25 2020-11-15 C     <NA>          
#> 14     3 2020-11-10 2021-01-12 D     C + D         
#> 15     4 2020-04-10 2020-05-12 D     <NA>          
#> 16     4 2020-04-10 2020-08-20 B     D + B         
#> 17     4 2020-08-15 2020-09-22 C     B + C         
#> 18     4 2020-10-25 2020-11-15 C     <NA>          
#> 19     4 2020-10-27 2021-01-12 D     C + D

^{Created on 2022-08-23 by the reprex package (v2.0.1)}

The NA_character_ is just so you have a defined value if the condition is not met. You could use whitespace or something else if you'd like, but I think NA makes it clear.

Kind regards

FactOREO · August 23, 2022, 7:30pm

Please accept the answer if it serves your needs. Thanks.

Khoajak · August 23, 2022, 8:23pm

FactOREO:

library(data.table)
library(dplyr)

data |>
  group_by(Id) |>
  mutate(
    combined_treat = fcase(
      Startdate < shift(Enddate), paste(shift(Trt),Trt, sep = ' + '),
      default = NA_character_
      )
    )

Thank you FactOREO. It is exactly what I wanted.

nedallo · August 24, 2022, 10:54am

Dear FactOREO

hope my message finds you well ,

just wondering why did you use group_by(Id) , if we didn't group it by Id i guess we would get the same result.

thank you

FactOREO · August 24, 2022, 11:11am

Hello @nedallo,

actually you would not (in any case). Imagine a situation, where the condition is fullfilled (current Startdate < previous Enddate), but with two different subjects (e.g. IDs). Since the condition would be met, a combined treat would be the output. But since the subjects are different, it would make no sense, e.g. how could it be meaningful to notice a combined treat (A + B) for a person (ID 2), caused by an overlap in the given dates between ID 1 and ID 2.

In this case there wasn't a "double treatment" for ID 2, but rather a (random) overlap in the treatment dates between two subjects.

Hence it is necessary to group by ID, to avoid such "useless" (or even wrong) informations.

Kind regards

nedallo · August 24, 2022, 5:09pm

thank you for your informative reply ,, appreciated
have a wonderful day

system · August 31, 2022, 5:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.