Filter Between Date Ranges by Individual Participant

Hi Experts.

Apologies for my brevity but I'm hoping this is a relatively simple answer.

I would like to use dplyr to filter a (very large) dataset I'm calling via arrow to a list of participants I have in one dataframe to those I have in another, like the following:

output <- arrow_mapping |> filter(PatientId %in% DataframeOfInterest$PatientId) |> select(PatientId,Date,VOI1,VOI2,VOI3) |> collect()

However, I would also like to apply a filter that ONLY keeps the instances where a Date variable in the dataframe falls between two dates that are unique to each participant in the other dataframe.

I didn't see a lot of information in the documentation or Stack Overflow on how to do this, would you be able to provide a suggestion as to how I could perform this kind of filtering on a PatientId level?

I've also made a Stackoverflow Question with some tables to make this a little easier to interpret:

Please feel free to refer to it if anything is unclear.

Just a quick heads-up: The code for building the "Target" table is incomplete.


Likely a copy pasting error on my part.

1 Like

Have you tried using the inner_join() function?

Trying something like that now.

The actual data has 300,000k vs 2,000k rows and I was hoping to avoid doing a join and put that all into memory.

With the right dplyr syntax, I should be able to do it without overburdening my memory too much but waiting to see how that turns out.

It shouldn't be necessary to do the join and then filter — have you seen this?

This does the joining and filtering at the same time, so should be a little less taxing on memory:

original data for tables "Reference" and "Target"
Reference <- 
structure(list(ParticipantId = 10001:10003, DateA = c("3/12/2013", 
"5/15/2022", "9/20/2022")), class = "data.frame", row.names = c(NA, 

Target <- 
structure(list(ParticipantId = c(10001L, 10001L, 10001L, 10001L, 
10002L, 10002L, 10002L, 10003L, 10003L), Date1 = c("1/2/2010", 
"1/2/2016", "1/2/2019", "1/2/2021", "1/2/2016", "1/2/2019", "1/2/2021", 
"1/2/2019", "1/2/2021"), Date2 = c("1/2/2015", "1/2/2018", "1/2/2020", 
"1/2/2023", "1/2/2018", "1/2/2020", "1/2/2023", "1/2/2020", "1/2/2023"
)), class = "data.frame", row.names = c(NA, -9L))

Reference |> 
  mutate(DateA = parse_date(DateA, format =  '%m/%d/%Y')) |> 
    Target |> 
        \(d) d |> parse_date(format =  '%m/%d/%Y')
    join_by(ParticipantId, between(DateA, Date1, Date2))
#>   ParticipantId      DateA      Date1      Date2
#> 1         10001 2013-03-12 2010-01-02 2015-01-02
#> 2         10002 2022-05-15 2021-01-02 2023-01-02
#> 3         10003 2022-09-20 2021-01-02 2023-01-02

Created on 2024-05-23 with reprex v2.0.2

I think this will work but unrelated issue:

Error in `compute.arrow_dplyr_query()`:
! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(ParticipantId) of type large_string and FieldRef.Name(ParticipantId) of type string
Run `rlang::last_trace()` to see where the error occurred.

Would you happen to know how to convert a string in a tibble/dataframe to a large_string?
I'd like to avoid editing multiple schema if possible.

It's hard to know without having the code and data in hand, and if it's unrelated, it's best to post as a new topic.

So I resolved the listing issue.

But if you would, could you share your solution without the date conversion?

Currently the following seems to work (as the data is already dates on my end):

Output <-
  Target |> left_join(Reference, by = join_by(ParticipantId)) |>
  filter(between(DateA, Date1, Date2)) |>

But the one above fails from what I assume is a syntax error on my part:

Output <-
Reference |> 
    Target, join_by(PaticipantId, between(DateA, Date1, Date2))
  ) |>
Error in `auto_copy()`:
! `x` and `y` must share the same src.
ℹ `x` is a <tbl_df/tbl/data.frame> object.
ℹ `y` is a <FileSystemDataset/Dataset/ArrowObject/R6> object.
ℹ Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow).
Run `rlang::last_trace()` to see where the error occurred.

Actually I think that error is a dimensional issue between Target and Reference on my side because it works just fine when I flip it.

However, it does throw:

Error in `handle_join_by()`:
! Inequality conditions and helper functions are not supported in `join_by()` expressions.
Run `rlang::last_trace()` to see where the error occurred.

When I run:

Output <-
Reference |> 
    Target, by = join_by(PatientId, between(DateA, Date1, Date2))
  ) |>

Do you mean this works?

Output <-
  Target |> 
    Reference, by = join_by(PatientId, between(DateA, Date1, Date2))
  ) |>

I don't know if this is likely to be of any use as it still requires a join but using {data.table} might reduce memory requirements and should give you more speed.


target <- fread("target.csv")
ref <- fread("ref.csv")

target[, c("Date1", "Date2") := lapply(.SD, mdy),
       .SDcols = c("Date1", "Date2")][, ParticipantId := as.character(ParticipantId)][]

ref[, DateA := mdy(DateA)][ , ParticipantId := as.character(ParticipantId)]

setkey(target, ParticipantId)
setkey(ref, ParticipantId)

DT1 <- ref[target]

DT1[DateA >= Date1 & DateA <= Date2 ]

In the real data (which I would provide if I could), the Target is orders of magnitude higher than the reference.

This is a little surprising given what I understand about the syntax of between(): Could you post the output the following?

Target |> 
  select(contains('Date')) |> 

Reference |> 
  select(contains('Date')) |> 

Won't do much good I'm afraid given the nature of the data.

I think I can work with some of the parts of the solutions here.

Please don't trouble yourself any further than necessary.

The only thing that occurs to me is that in the Stack Exchange post (which is what my suggestion applies to) maybe you switched which columns were supposed in the Target and Reference tables? Because you would definitely get an error with code I suggested if you the roles of Target and Reference are swapped.

There is not a good way to do this using only filter(), because your constraint values are from another DF. There are a couple approaches to accomplish this, each I can think of requires making a lookup dataframe containing PatientId, minDate, maxDate from the first DF, joining that in, and then filtering against those values. A reproducible example of this approach is below. I am using the powerful combo of tidyr::nest() and purrr::map(). nest() splits the DF by PatientId to create a column data which is a list of DFs for each PatientId. map() then applies filter() on each of those DFs into a new list column 'filteredDFs'. unnest() reveals the filtered dataframe from the list in filteredDFs.




## Dummy Dataframe of Interest

dataFrameOfInterest <- data.frame(
    PatientId = rep(c(90011,9002,9003,900444,9005), 500),
    Date = sample(seq.Date(as.Date('2022-01-01'),as.Date('2024-04-20'), 'day'), 
        2500, replace = TRUE

## Dummy Arrow Mapping

arrow_mapping <- data.frame(
    PatientId = rep(c(9001,9002,9003,9004,9005), 100000),
    Date = sample(seq.Date(as.Date('2020-01-01'), as.Date('2024-04-20'), 'day'), 
        500000, replace = TRUE
    VOI1 = sample(LETTERS, 500000, replace = TRUE),
    VOI2 = sample(letters, 500000, replace = TRUE),
    VOI3 = sample(1:100000, 500000, replace = TRUE)

### 1) Create a lookup of unique ID and date ranges

lu_dfoi <- dataFrameOfInterest %>% 
    group_by(PatientId) %>% 
        minDate = min(Date, na.rm = TRUE),
        maxDate = max(Date, na.rm = TRUE)

### 2) Join Min/Max dates by PatientId, getting ID-specific date range
### Nest by PatientId, map through each dataframe in filteredDFs, unnest

output <- arrow_mapping %>% 
    left_join(., lu_dfoi, by = 'PatientId') %>% 
    tidyr::nest(data = !PatientId) %>% 
        filteredDFs = purrr::map(seq_along(data), ~ 
            filter(data[[.x]], Date >= minDate, Date <= maxDate)
    ) %>% 
    unnest(cols = c(filteredDFs)) %>% 


1 Like