Converting/merging rows in a dataset depending on dates in next row(s)

Nice · April 7, 2023, 2:31pm

I have a very big dataset including data for 130.000 persons (ID). Each person has been using a medicine in some intervals, sometimes with unintended periods of pauses (dose=0). However, when the pause was intended by physician, there is no row of dose=0, but the next row simply starts with a new medication-period.

What I want to do is: when a medication-period was intended, i.e. followed by one or more medication-periods with no rows with pauses (dose=0) inbetween, then these rows should be changed to 1 row starting with "firstday" of the first medication-period and ending with the "lastday" of the last medication-period. Otherwise, if the rows are separated by a row with "dose=0", then this mean the pause was unintended and nothing should be changed in this row.

Here is an example:

test_df <- structure(list(ID = c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c"), dose = c(20, 0, 20, 0, 20, 20, 5, 5, 5, 5, 5, 5, 5, 0, 5, 5), firstday= c("2016-04-28", "2016-08-08","2018-01-20","2018-04-30","2019-12-23","2020-04-02", "2019-01-04", "2019-01-30", "2019-02-26", "2019-03-12", "2019-04-11", "2019-01-31", "2019-03-05", "2019-03-12", "2019-04-23", "2019-05-27"), lastday= c("2016-08-08", "2018-01-20","2018-04-30","2019-12-23","2020-04-01","2020-07-11", "2019-01-11", "2019-02-06", "2019-03-05", "2019-03-19", "2019-04-18", "2019-02-14", "2019-03-12", "2019-04-23", "2019-04-30", "2019-06-03")), row.names = c(NA, -16L), class = "data.frame", .Names = c("ID", "dose", "firstday", "lastday"))

As you can see,"a" had 2 medication-periods after each other with no "dose=0", which mean that row 5 and 6 should be changed to 1 row containing "firstday" from row 5 and "lastday" from row 6. Rows 7 to 11 (fro ID=b) should be converted to only 1 row, containing "firstday" from row 7 and "lastday" from row 11. And rows 15 and 16 (for ID=c) should be converted to 1 row containing "firstday" from row 15 and "lastday" from row 16.

I have tried to do it stepwise by first generating a new column including the dates I want and then deleting excess columns:

tryied < test_df %>% group_by(ID) %>% mutate(new_column = ifelse(dose[row_number() + 1] , lastday, 0), new_column = c(new_column[-n()] , 0))

Howevever, this only works if a medication-period is followed by 1 other medication-period and not if there are several medication-periods following each other, and is not satisfiying at all...

scottyd22 · April 7, 2023, 9:28pm

A more straight-forward approach probably exists, but below is one way to achieve the desired outcome, which uses the lead() and lag() functions.

library(tidyverse)

out = test_df |>
  group_by(ID) |>
  # identify where a consecutive dose group starts
  mutate(group_start = case_when(
    dose == 0 ~ 0,
    lag(dose) == dose ~ 0,
    lead(dose) == dose ~ 1,
    TRUE ~ 0
    )) |>
  # generate a group number for the consecutive group
  mutate(group_num = case_when(
    dose == 0 ~ 0,
    lead(dose) == dose | lag(dose) == dose ~ cumsum(group_start),
    TRUE ~ 0
  )) |>
  ungroup() |>
  # get the start and end dates for each group num
  group_by(ID, group_num) |>
  mutate(firstday = case_when(
    group_num == 0 ~ firstday,
    TRUE ~ min(firstday)
  )) |>
  mutate(lastday = case_when(
    group_num == 0 ~ lastday,
    TRUE ~ max(lastday)
  )) |>
  ungroup() |>
  # remove the temporary working columns
  select(-group_start, -group_num) |>
  distinct()

out
#> # A tibble: 9 × 4
#>   ID     dose firstday   lastday   
#>   <chr> <dbl> <chr>      <chr>     
#> 1 a        20 2016-04-28 2016-08-08
#> 2 a         0 2016-08-08 2018-01-20
#> 3 a        20 2018-01-20 2018-04-30
#> 4 a         0 2018-04-30 2019-12-23
#> 5 a        20 2019-12-23 2020-07-11
#> 6 b         5 2019-01-04 2019-04-18
#> 7 c         5 2019-01-31 2019-03-12
#> 8 c         0 2019-03-12 2019-04-23
#> 9 c         5 2019-04-23 2019-06-03

Created on 2023-04-07 with reprex v2.0.2

Nice · April 8, 2023, 12:05pm

This is perfect! It works.
Thank you very very much!

system · May 20, 2023, 12:06pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.