Filter Between Date Ranges by Individual Participant

TPDeRamus · May 23, 2024, 2:05pm

Hi Experts.

Apologies for my brevity but I'm hoping this is a relatively simple answer.

I would like to use dplyr to filter a (very large) dataset I'm calling via arrow to a list of participants I have in one dataframe to those I have in another, like the following:

output <- arrow_mapping |> filter(PatientId %in% DataframeOfInterest$PatientId) |> select(PatientId,Date,VOI1,VOI2,VOI3) |> collect()

However, I would also like to apply a filter that ONLY keeps the instances where a Date variable in the dataframe falls between two dates that are unique to each participant in the other dataframe.

I didn't see a lot of information in the documentation or Stack Overflow on how to do this, would you be able to provide a suggestion as to how I could perform this kind of filtering on a PatientId level?

TPDeRamus · May 23, 2024, 5:23pm

I've also made a Stackoverflow Question with some tables to make this a little easier to interpret:

Please feel free to refer to it if anything is unclear.

dromano · May 23, 2024, 6:43pm

Just a quick heads-up: The code for building the "Target" table is incomplete.

TPDeRamus · May 23, 2024, 6:48pm

Apologies.

Likely a copy pasting error on my part.

dromano · May 23, 2024, 7:01pm

Have you tried using the inner_join() function?

TPDeRamus · May 23, 2024, 7:13pm

Trying something like that now.

The actual data has 300,000k vs 2,000k rows and I was hoping to avoid doing a join and put that all into memory.

With the right dplyr syntax, I should be able to do it without overburdening my memory too much but waiting to see how that turns out.

dromano · May 23, 2024, 7:21pm

It shouldn't be necessary to do the join and then filter — have you seen this?

dromano · May 23, 2024, 7:39pm

This does the joining and filtering at the same time, so should be a little less taxing on memory:

original data for tables "Reference" and "Target"

Reference <- 
structure(list(ParticipantId = 10001:10003, DateA = c("3/12/2013", 
"5/15/2022", "9/20/2022")), class = "data.frame", row.names = c(NA, 
-3L))

Target <- 
structure(list(ParticipantId = c(10001L, 10001L, 10001L, 10001L, 
10002L, 10002L, 10002L, 10003L, 10003L), Date1 = c("1/2/2010", 
"1/2/2016", "1/2/2019", "1/2/2021", "1/2/2016", "1/2/2019", "1/2/2021", 
"1/2/2019", "1/2/2021"), Date2 = c("1/2/2015", "1/2/2018", "1/2/2020", 
"1/2/2023", "1/2/2018", "1/2/2020", "1/2/2023", "1/2/2020", "1/2/2023"
)), class = "data.frame", row.names = c(NA, -9L))

library(tidyverse)

Reference |> 
  mutate(DateA = parse_date(DateA, format =  '%m/%d/%Y')) |> 
  inner_join(
    Target |> 
      mutate(
        across(
          contains('Date'),
        \(d) d |> parse_date(format =  '%m/%d/%Y')
        )
      ),
    join_by(ParticipantId, between(DateA, Date1, Date2))
  )
#>   ParticipantId      DateA      Date1      Date2
#> 1         10001 2013-03-12 2010-01-02 2015-01-02
#> 2         10002 2022-05-15 2021-01-02 2023-01-02
#> 3         10003 2022-09-20 2021-01-02 2023-01-02

^{Created on 2024-05-23 with reprex v2.0.2}

TPDeRamus · May 24, 2024, 1:07pm

I think this will work but unrelated issue:

Error in `compute.arrow_dplyr_query()`:
! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(ParticipantId) of type large_string and FieldRef.Name(ParticipantId) of type string
Run `rlang::last_trace()` to see where the error occurred.

Would you happen to know how to convert a string in a tibble/dataframe to a large_string?
I'd like to avoid editing multiple schema if possible.

dromano · May 24, 2024, 1:22pm

It's hard to know without having the code and data in hand, and if it's unrelated, it's best to post as a new topic.

TPDeRamus · May 24, 2024, 3:08pm

dromano:

Reference |> 
  mutate(DateA = parse_date(DateA, format =  '%m/%d/%Y')) |> 
  inner_join(
    Target |> 
      mutate(
        across(
          contains('Date'),
        \(d) d |> parse_date(format =  '%m/%d/%Y')
        )
      ),
    join_by(ParticipantId, between(DateA, Date1, Date2))
  )

So I resolved the listing issue.

But if you would, could you share your solution without the date conversion?

Currently the following seems to work (as the data is already dates on my end):

Output <-
  Target |> left_join(Reference, by = join_by(ParticipantId)) |>
  filter(between(DateA, Date1, Date2)) |>
  collect()

But the one above fails from what I assume is a syntax error on my part:

Output <-
Reference |> 
  inner_join(
    Target, join_by(PaticipantId, between(DateA, Date1, Date2))
  ) |>
  collect()
Error in `auto_copy()`:
! `x` and `y` must share the same src.
ℹ `x` is a <tbl_df/tbl/data.frame> object.
ℹ `y` is a <FileSystemDataset/Dataset/ArrowObject/R6> object.
ℹ Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow).
Run `rlang::last_trace()` to see where the error occurred.

TPDeRamus · May 24, 2024, 3:44pm

Actually I think that error is a dimensional issue between Target and Reference on my side because it works just fine when I flip it.

However, it does throw:

Error in `handle_join_by()`:
! Inequality conditions and helper functions are not supported in `join_by()` expressions.
Run `rlang::last_trace()` to see where the error occurred.

When I run:

Output <-
Reference |> 
  inner_join(
    Target, by = join_by(PatientId, between(DateA, Date1, Date2))
  ) |>
  collect()

dromano · May 24, 2024, 3:50pm

Do you mean this works?

Output <-
  Target |> 
  inner_join(
    Reference, by = join_by(PatientId, between(DateA, Date1, Date2))
  ) |>
  collect()

jrkrideau · May 24, 2024, 4:37pm

I don't know if this is likely to be of any use as it still requires a join but using {data.table} might reduce memory requirements and should give you more speed.

suppressMessages(library(lubridate)) 
suppressMessages(library(data.table)) 

target <- fread("target.csv")
ref <- fread("ref.csv")

target[, c("Date1", "Date2") := lapply(.SD, mdy),
       .SDcols = c("Date1", "Date2")][, ParticipantId := as.character(ParticipantId)][]

ref[, DateA := mdy(DateA)][ , ParticipantId := as.character(ParticipantId)]

setkey(target, ParticipantId)
setkey(ref, ParticipantId)

DT1 <- ref[target]

DT1[DateA >= Date1 & DateA <= Date2 ]

TPDeRamus · May 24, 2024, 7:53pm

Correct.
In the real data (which I would provide if I could), the Target is orders of magnitude higher than the reference.

dromano · May 24, 2024, 8:12pm

This is a little surprising given what I understand about the syntax of between(): Could you post the output the following?

Target |> 
  select(contains('Date')) |> 
  names()

Reference |> 
  select(contains('Date')) |> 
  names()

TPDeRamus · May 24, 2024, 8:19pm

Won't do much good I'm afraid given the nature of the data.

I think I can work with some of the parts of the solutions here.

Please don't trouble yourself any further than necessary.

dromano · May 24, 2024, 9:06pm

The only thing that occurs to me is that in the Stack Exchange post (which is what my suggestion applies to) maybe you switched which columns were supposed in the Target and Reference tables? Because you would definitely get an error with code I suggested if you the roles of Target and Reference are swapped.

peernisse · May 26, 2024, 11:12pm

There is not a good way to do this using only filter(), because your constraint values are from another DF. There are a couple approaches to accomplish this, each I can think of requires making a lookup dataframe containing PatientId, minDate, maxDate from the first DF, joining that in, and then filtering against those values. A reproducible example of this approach is below. I am using the powerful combo of tidyr::nest() and purrr::map(). nest() splits the DF by PatientId to create a column data which is a list of DFs for each PatientId. map() then applies filter() on each of those DFs into a new list column 'filteredDFs'. unnest() reveals the filtered dataframe from the list in filteredDFs.

library(dplyr)
library(purrr)
library(tidyr)

# MAKE TEST DATA ----

set.seed(42)

## Dummy Dataframe of Interest

dataFrameOfInterest <- data.frame(
    PatientId = rep(c(90011,9002,9003,900444,9005), 500),
    Date = sample(seq.Date(as.Date('2022-01-01'),as.Date('2024-04-20'), 'day'), 
        2500, replace = TRUE
    )
)

## Dummy Arrow Mapping

arrow_mapping <- data.frame(
    PatientId = rep(c(9001,9002,9003,9004,9005), 100000),
    Date = sample(seq.Date(as.Date('2020-01-01'), as.Date('2024-04-20'), 'day'), 
        500000, replace = TRUE
    ),
    VOI1 = sample(LETTERS, 500000, replace = TRUE),
    VOI2 = sample(letters, 500000, replace = TRUE),
    VOI3 = sample(1:100000, 500000, replace = TRUE)
)

### 1) Create a lookup of unique ID and date ranges

lu_dfoi <- dataFrameOfInterest %>% 
    group_by(PatientId) %>% 
    summarize(
        minDate = min(Date, na.rm = TRUE),
        maxDate = max(Date, na.rm = TRUE)
    )

### 2) Join Min/Max dates by PatientId, getting ID-specific date range
### Nest by PatientId, map through each dataframe in filteredDFs, unnest

output <- arrow_mapping %>% 
    left_join(., lu_dfoi, by = 'PatientId') %>% 
    tidyr::nest(data = !PatientId) %>% 
    mutate(
        filteredDFs = purrr::map(seq_along(data), ~ 
            filter(data[[.x]], Date >= minDate, Date <= maxDate)
        )
    ) %>% 
    unnest(cols = c(filteredDFs)) %>% 
    select(PatientId,Date,VOI1,VOI2,VOI3)

Cheers

system · August 24, 2024, 11:12pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.