where x ranges from 0 to 65 inclusive (i.e. there's 66 columns of f.41262.0.x, for example f.41262.0.0, f.41262.0.1,...). These are all different dates of various diagnoses and so I would like to search for the closest date to each one in bd_extracted_rows_disease_date_of_death and then subtract like so: bd_extracted_rows_disease_date_of_death - bd_extracted_rows_disease_date_of_in_patient_diagnosis to see how long each participant survived. This works if I just take the first column, f.41262.0.0 but I can't trust that these are the correct dates for this particular disease so I need to find the closest date to the date of death and then take the difference.
Does anyone have any ideas of which functions I might use and how I might implement this? Here's the minimal working example (which works) when I use column f.41262.0.0:
If I understand you well, this the attached codecould help.
I create a file for 2 id's (persons) with 2 possible diseases (but note that the code is not depended on these particular numbers.
The first two steps only serve to create a file simular to yours. So assume df2 is your input file.
In variable diseases I gather the names of the disease variables.
In df3 I have calculated for all diseases the difference in days with the 'dying' date (and set this difference to 99999 when there is not date for the disease).
In df4 I have calculated the disease that has a date most near the dying date and the difference in days.
I think this is what need and otherwise you will have some inspiration.
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
df1 <- tibble::tribble(
~id, ~f.40000.0.0, ~f.41262.0.0, ~f.41262.0.1,
1L, "01/01/2014", "01/01/2013" , NA,
2L, "01/01/2014", "01/01/2013" , "01/06/2013"
)
df1
#> # A tibble: 2 x 4
#> id f.40000.0.0 f.41262.0.0 f.41262.0.1
#> <int> <chr> <chr> <chr>
#> 1 1 01/01/2014 01/01/2013 <NA>
#> 2 2 01/01/2014 01/01/2013 01/06/2013
df2 =df1 %>%
mutate(
across(
.cols = starts_with("f.4"),
.fns = ~ lubridate::dmy(.x)
)
)
df2
#> # A tibble: 2 x 4
#> id f.40000.0.0 f.41262.0.0 f.41262.0.1
#> <int> <date> <date> <date>
#> 1 1 2014-01-01 2013-01-01 NA
#> 2 2 2014-01-01 2013-01-01 2013-06-01
diseases = setdiff(names(df1),c('id','f.40000.0.0'))
df3 = df2 %>%
mutate(
across(
.cols = any_of(diseases),
.fns = function(x,y) ifelse(is.na(x),99999,y-x),
df2$f.40000.0.0
)
)
df3
#> # A tibble: 2 x 4
#> id f.40000.0.0 f.41262.0.0 f.41262.0.1
#> <int> <date> <dbl> <dbl>
#> 1 1 2014-01-01 365 99999
#> 2 2 2014-01-01 365 214
min_i = function(...) {
a = unlist(list(...))
order(a)[1]
}
min_v = function(...) {
a = unlist(list(...))
a[order(a)[1]]
}
df4 = df3 %>%
rowwise() %>%
mutate(
min_ix = min_i (c_across(cols = any_of(diseases))),
min_ix = diseases[min_ix],
min_d = min_v (c_across(cols = any_of(diseases))),
)
df4
#> # A tibble: 2 x 6
#> # Rowwise:
#> id f.40000.0.0 f.41262.0.0 f.41262.0.1 min_ix min_d
#> <int> <date> <dbl> <dbl> <chr> <dbl>
#> 1 1 2014-01-01 365 99999 f.41262.0.0 365
#> 2 2 2014-01-01 365 214 f.41262.0.1 214