How can I use R (Base or Tidyverse) to flag each patient IDs last non-missing screening record as baseline?

dstokar · March 4, 2020, 12:39am

Here is some sample data

df <- tibble::tribble(
~subjid, ~Date, ~SBP, ~Visit_Type,
1, "15-Jan-19", 125, "Screening",
1, "16-Jan-19", 130, "Screening",
1, "17-Jan-19", 127, NA,
1, "18-Jan-19", 120, NA,
2, "9-Jan-19", 145, "Screening",
2, "10-Jan-19", 130, "Screening",
2, "11-Jan-19", 140, NA,
2, "12-Jan-19", 120, NA,
3, "10-Feb-19", 145, "Screening",
3, "12-Feb-19", NA, "Screening",
3, "13-Feb-19", 140, NA,
3, "15-Feb-19", 120, NA
)

None of the answers so far is correct. I want the last non missing record that is a screening record (and then call that baseline) as below:
1 16-Jan-19 130 Screening Flag
2 10-Jan-19 130 Screening Flag
3 10-Feb-19 145 Screening Flag

williaml · March 4, 2020, 4:27am

First of all, you should make sure that you have a reproducible example to make it easier for us trying to answer your question:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

But, is this what you are after?

library(tidyverse)

df <- tibble::tribble(
  ~subjid,       ~Date, ~SBP, ~Visit_Type,
        1, "15-Jan-19",  125, "Screening",
        1, "16-Jan-19",  130, "Screening",
        1, "17-Jan-19",  127,          NA,
        1, "18-Jan-19",  120,          NA,
        2,  "9-Jan-19",  145, "Screening",
        2, "10-Jan-19",  130, "Screening",
        2, "11-Jan-19",  140,          NA,
        2, "12-Jan-19",  120,          NA,
        3, "10-Feb-19",  145, "Screening",
        3, "12-Feb-19",   NA, "Screening",
        3, "13-Feb-19",  140,          NA,
        3, "15-Feb-19",  120,          NA
  )

df

df2 <- df %>% 
  group_by(subjid) %>% 
  filter(Visit_Type == "Screening") %>% 
  slice(-1) %>% 
  mutate(Baseline_flag = TRUE) %>% 
  select(-SBP, -Visit_Type)


df3 <- left_join(df, df2, by = c("subjid", "Date"))

> df3
# A tibble: 12 x 5
   subjid Date        SBP Visit_Type Baseline_flag
    <dbl> <chr>     <dbl> <chr>      <lgl>        
 1      1 15-Jan-19   125 Screening  NA           
 2      1 16-Jan-19   130 Screening  TRUE         
 3      1 17-Jan-19   127 NA         NA           
 4      1 18-Jan-19   120 NA         NA           
 5      2 9-Jan-19    145 Screening  NA           
 6      2 10-Jan-19   130 Screening  TRUE         
 7      2 11-Jan-19   140 NA         NA           
 8      2 12-Jan-19   120 NA         NA           
 9      3 10-Feb-19   145 Screening  NA           
10      3 12-Feb-19    NA Screening  TRUE         
11      3 13-Feb-19   140 NA         NA           
12      3 15-Feb-19   120 NA         NA

DavoWW · March 4, 2020, 4:49am

I think the OP probably wants to also exclude instances where SBP is NA. In addition, to be on the safe side, sort the dates first:

df2 <- df %>% 
  group_by(subjid) %>% 
  filter(Visit_Type == "Screening" & !is.na(SBP)) %>%
  arrange(desc(Date), .by_group=TRUE) %>% 
  slice(1) %>% 
  mutate(Baseline_flag = TRUE) %>% 
  select(-SBP, -Visit_Type)

df2
df3 <- left_join(dat, df2, by = c("subjid", "Date"))
df3

HTH

vinaychuri · March 4, 2020, 4:53am

Assuming the OP wants to flag the Baseline_Flag based on subjid, Date and last entry of Visit_Type.

> library(tidyverse)
> df %>%
>   mutate(Date = lubridate::dmy(Date)) %>%
>   group_by(subjid) %>%
>   arrange(subjid, Date, Visit_Type) %>%
>   mutate(Baseline_Flag = ifelse(is.na(Visit_Type), "Remove", "Keep")) %>%
>   arrange(subjid, Baseline_Flag) %>%
>   mutate(Baseline_Flag = ifelse(Baseline_Flag=="Remove","Remove",
>                                  ifelse(Baseline_Flag==lag(Baseline_Flag), "Keep", "Remove"))) %>%
>   mutate(Baseline_Flag=ifelse(is.na(Baseline_Flag),"Remove",Baseline_Flag))%>%
>   arrange(subjid, Date, Visit_Type) %>%
>   ungroup()

DavoWW · March 4, 2020, 5:26am

As an aside, I tried to solve this problem using a case_when construct but couldn't get it to work. Anyone have a solution?

df %>% 
  group_by(subjid) %>%
  arrange(desc(Date), .by_group=TRUE) %>%
  mutate(Baseline_Flag = case_when(
    !is.na(SBP) & !is.na(Visit_Type) & Date == top_n(1, Date) ~ TRUE,
    FALSE ~ FALSE))

joels · March 4, 2020, 5:28am

Using williaml's df data frame, we identify the last screening as follows: First, sort the dates to put the most recent first. Then, by subjid we use the duplicated function to find the first row where Visit_Type is not missing (!duplicated means "Not duplicated", which returns TRUE only for the first instance). The result is a new column that is TRUE for the last screening or FALSE otherwise.

library(tidyverse)

# Convert Date column to Date format
df = df %>% mutate(Date = lubridate::dmy(Date))

df = df %>% 
  arrange(desc(Date)) %>% 
  group_by(subjid) %>% 
  mutate(last.screening = !is.na(Visit_Type) & !duplicated(Visit_Type)) %>% 
  arrange(subjid, Date)

df
#>    subjid Date         SBP Visit_Type last.screening
#>     <dbl> <date>     <dbl> <chr>      <lgl>         
#>  1      1 2019-01-15   125 Screening  FALSE         
#>  2      1 2019-01-16   130 Screening  TRUE          
#>  3      1 2019-01-17   127 <NA>       FALSE         
#>  4      1 2019-01-18   120 <NA>       FALSE         
#>  5      2 2019-01-09   145 Screening  FALSE         
#>  6      2 2019-01-10   130 Screening  TRUE          
#>  7      2 2019-01-11   140 <NA>       FALSE         
#>  8      2 2019-01-12   120 <NA>       FALSE         
#>  9      3 2019-02-10   145 Screening  FALSE         
#> 10      3 2019-02-12    NA Screening  TRUE          
#> 11      3 2019-02-13   140 <NA>       FALSE         
#> 12      3 2019-02-15   120 <NA>       FALSE

# Filter to keep only last screening for each subject
df %>% filter(last.screening)
#>   subjid Date         SBP Visit_Type last.screening
#>    <dbl> <date>     <dbl> <chr>      <lgl>         
#> 1      1 2019-01-16   130 Screening  TRUE          
#> 2      2 2019-01-10   130 Screening  TRUE          
#> 3      3 2019-02-12    NA Screening  TRUE

dstokar · March 5, 2020, 1:56pm

These answers are not correct as they are not choosing the last non-missing 'Screening' record as baseline (TRUE). For Patient 3 - I want the baseline to be Feb 10 because it is not missing ( Feb 12 - the last screening record - has NA).

Can anyone advise on how to do this correctly?

siddharthprabhu · March 5, 2020, 2:53pm

Just a slight modification to @DavoWW's answer should do the trick.

library(tidyverse)

df <- tibble::tribble(
  ~subjid,       ~Date, ~SBP, ~Visit_Type,
  1, "15-Jan-19",  125, "Screening",
  1, "16-Jan-19",  130, "Screening",
  1, "17-Jan-19",  127,          NA,
  1, "18-Jan-19",  120,          NA,
  2,  "9-Jan-19",  145, "Screening",
  2, "10-Jan-19",  130, "Screening",
  2, "11-Jan-19",  140,          NA,
  2, "12-Jan-19",  120,          NA,
  3, "10-Feb-19",  145, "Screening",
  3, "12-Feb-19",   NA, "Screening",
  3, "13-Feb-19",  140,          NA,
  3, "15-Feb-19",  120,          NA
)

df2 <- df %>% 
  group_by(subjid) %>% 
  filter(Visit_Type == "Screening" & !is.na(SBP)) %>% 
  slice(n()) %>% 
  mutate(Baseline_flag = TRUE) %>% 
  select(-SBP, -Visit_Type)


df3 <- left_join(df, df2, by = c("subjid", "Date"))

print(df3)
#> # A tibble: 12 x 5
#>    subjid Date        SBP Visit_Type Baseline_flag
#>     <dbl> <chr>     <dbl> <chr>      <lgl>        
#>  1      1 15-Jan-19   125 Screening  NA           
#>  2      1 16-Jan-19   130 Screening  TRUE         
#>  3      1 17-Jan-19   127 <NA>       NA           
#>  4      1 18-Jan-19   120 <NA>       NA           
#>  5      2 9-Jan-19    145 Screening  NA           
#>  6      2 10-Jan-19   130 Screening  TRUE         
#>  7      2 11-Jan-19   140 <NA>       NA           
#>  8      2 12-Jan-19   120 <NA>       NA           
#>  9      3 10-Feb-19   145 Screening  TRUE         
#> 10      3 12-Feb-19    NA Screening  NA           
#> 11      3 13-Feb-19   140 <NA>       NA           
#> 12      3 15-Feb-19   120 <NA>       NA

^{Created on 2020-03-05 by the reprex package (v0.3.0)}

dstokar · March 5, 2020, 9:39pm

Hi , Your code works on this specific case but it’s not general enough for my use. I changed my example below to make it more realistic.
I made 3 cases: 1 where the first screening record is missing and the other 2 with the 2nd and 3rd missing. Again I want the last non-missing screening record to be kept or better - assigned as Baseline.

library(tidyverse)

df <- tibble::tribble(
~subjid, ~Date, ~SBP, ~Visit_Type,
1, "14-Jan-19", NA, "Screening",
1, "15-Jan-19", 125, "Screening",
1, "16-Jan-19", 130, "Screening",
1, "17-Jan-19", 127, NA,
1, "18-Jan-19", 120, NA,
2, "9-Jan-19", 145, "Screening",
2, "10-Jan-19", NA , "Screening",
2, "11-Jan-19", 130, "Screening",
2, "12-Jan-19", 140, NA,
2, "13-Jan-19", 120, NA,
3, "09-Feb-19", 135, "Screening",
3, "10-Feb-19", 145, "Screening",
3, "12-Feb-19", NA, "Screening",
3, "13-Feb-19", 140, NA,
3, "15-Feb-19", 120, NA
)

joels · March 5, 2020, 9:56pm

Can you say what specifically is not general enough about siddharthprabhu's answer? If the Date column is converted to Date class, it works properly on both your original and revised examples, as far as I can tell.

Here's an approach that will keep all rows and mark the baseline rows:

library(tidyverse) 

df %>% 
  mutate(Date = lubridate::dmy(Date),
         Visit_Type = fct_explicit_na(factor(Visit_Type))) %>% 
  arrange(subjid, Date) %>% 
  group_by(subjid) %>% 
  mutate(baseline.flag = case_when(Date==max(Date[Visit_Type=="Screening" & !is.na(SBP)]) ~ "Baseline",
                                   TRUE ~ "Not baseline"))

   subjid Date         SBP Visit_Type baseline.flag
    <dbl> <date>     <dbl> <fct>      <chr>        
 1      1 2019-01-14    NA Screening  Not baseline 
 2      1 2019-01-15   125 Screening  Not baseline 
 3      1 2019-01-16   130 Screening  Baseline     
 4      1 2019-01-17   127 (Missing)  Not baseline 
 5      1 2019-01-18   120 (Missing)  Not baseline 
 6      2 2019-01-09   145 Screening  Not baseline 
 7      2 2019-01-10    NA Screening  Not baseline 
 8      2 2019-01-11   130 Screening  Baseline     
 9      2 2019-01-12   140 (Missing)  Not baseline 
10      2 2019-01-13   120 (Missing)  Not baseline 
11      3 2019-02-09   135 Screening  Not baseline 
12      3 2019-02-10   145 Screening  Baseline     
13      3 2019-02-12    NA Screening  Not baseline 
14      3 2019-02-13   140 (Missing)  Not baseline 
15      3 2019-02-15   120 (Missing)  Not baseline

siddharthprabhu · March 6, 2020, 4:07am

The code from my previous post works fine on your new data as well (output shown below). Can you indicate where it is going wrong?

# A tibble: 15 x 5
   subjid Date        SBP Visit_Type Baseline_flag
    <dbl> <chr>     <dbl> <chr>      <lgl>        
 1      1 14-Jan-19    NA Screening  NA           
 2      1 15-Jan-19   125 Screening  NA           
 3      1 16-Jan-19   130 Screening  TRUE         
 4      1 17-Jan-19   127 NA         NA           
 5      1 18-Jan-19   120 NA         NA           
 6      2 9-Jan-19    145 Screening  NA           
 7      2 10-Jan-19    NA Screening  NA           
 8      2 11-Jan-19   130 Screening  TRUE         
 9      2 12-Jan-19   140 NA         NA           
10      2 13-Jan-19   120 NA         NA           
11      3 09-Feb-19   135 Screening  NA           
12      3 10-Feb-19   145 Screening  TRUE         
13      3 12-Feb-19    NA Screening  NA           
14      3 13-Feb-19   140 NA         NA           
15      3 15-Feb-19   120 NA         NA

dromano · March 6, 2020, 12:44pm

I think there may be two issues here: If the date is left as character, as in @siddharthprabhu' s code, the ordering will be alphabetical, rather than by date. This is fixed in @vinaychuri's and @joels's code, but it depends on which code @dstokar copied to use. The other issue is that a grouped table may be unpredictably reordered by a subsequent operation, so that slice() may not pick the intended rows. I think @joels's code should work, but here's another version to try, @dstokar:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
df <- tibble::tribble(
  ~subjid,       ~Date, ~SBP, ~Visit_Type,
  1, "15-Jan-19",  125, "Screening",
  1, "16-Jan-19",  130, "Screening",
  1, "17-Jan-19",  127,          NA,
  1, "18-Jan-19",  120,          NA,
  2,  "9-Jan-19",  145, "Screening",
  2, "10-Jan-19",  130, "Screening",
  2, "11-Jan-19",  140,          NA,
  2, "12-Jan-19",  120,          NA,
  3, "10-Feb-19",  145, "Screening",
  3, "12-Feb-19",   NA, "Screening",
  3, "13-Feb-19",  140,          NA,
  3, "15-Feb-19",  120,          NA
)

# summary version
df %>% 
  mutate(Date = dmy(Date)) %>% 
  filter(Visit_Type == 'Screening', !is.na(SBP)) %>% 
  group_by(subjid, Visit_Type) %>% 
  arrange(desc(Date)) %>% 
  slice(1) %>% 
  ungroup() %>% 
  mutate(Visit_Type = 'Baseline')
#> # A tibble: 3 x 4
#>   subjid Date         SBP Visit_Type
#>    <dbl> <date>     <dbl> <chr>     
#> 1      1 2019-01-16   130 Baseline  
#> 2      2 2019-01-10   130 Baseline  
#> 3      3 2019-02-10   145 Baseline
 
# full table version 
df %>% 
  mutate(Date = dmy(Date)) %>% 
  group_by(subjid, Visit_Type, is.na(SBP)) %>% 
  arrange(desc(Date)) %>%
  mutate(group_order = row_number()) %>% 
  ungroup() %>% 
  mutate(
    Visit_Type = 
           if_else(
             Visit_Type == 'Screening' & !is.na(SBP) & group_order == 1,
            'Baseline', 
            Visit_Type
            )
  ) %>% 
  select(-`is.na(SBP)`, -group_order)
#> # A tibble: 12 x 4
#>    subjid Date         SBP Visit_Type
#>     <dbl> <date>     <dbl> <chr>     
#>  1      3 2019-02-15   120 <NA>      
#>  2      3 2019-02-13   140 <NA>      
#>  3      3 2019-02-12    NA Screening 
#>  4      3 2019-02-10   145 Baseline  
#>  5      1 2019-01-18   120 <NA>      
#>  6      1 2019-01-17   127 <NA>      
#>  7      1 2019-01-16   130 Baseline  
#>  8      1 2019-01-15   125 Screening 
#>  9      2 2019-01-12   120 <NA>      
#> 10      2 2019-01-11   140 <NA>      
#> 11      2 2019-01-10   130 Baseline  
#> 12      2 2019-01-09   145 Screening

^{Created on 2020-03-06 by the reprex package (v0.3.0)}

Does this work in the right way?

system · March 27, 2020, 12:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.