[Arrow] Filtering an arrow table between a range of dates

Hi Posit Users.

I hope this is a simple questions that I just happened to not be able to find on the arrow documentation of available functions:

But would anyone happen to know how to filter between a date interval in an arrow table?

I have noticed that the lubridate is not yet arrow compatable:

> data |> filter(ADate %within% interval(ymd("2000-01-01"), ymd("2010-01-01")))
Error: Filter expression not supported for Arrow Datasets: ADate %within% interval(ymd(Colum), ymd("2010-01-01"))
Call collect() first to pull data into R.
In addition: Warning message:
tz(): Don't know how to compute timezone for object of class Expression/ArrowObject/R6; returning "UTC".

But I know arrow has functions processes such as: year, weeks_between, years_between, day_of_week ect.

Similarly, could this be done with a as.Date/Posixltformat?

Would appreciate any suggestions!

Thank you in advance.

Hi @TPDeRamus,

I'm not familiar with the %within% function, is it part of the arrow package?

The following uses base::as.Date() and dplyr::between(), which seems to do a similar thing?

data |> 
  filter(between(ADate, as.Date('2000-01-01'), as.Date('2010-01-01')))

alternatively, specifying upper and lower date limits like this:

data |> 
  filter(
    ADate >= as.Date('2000-01-01'), 
    ADate <= as.Date('2010-01-01')
  )
1 Like

Hi again,

I've just learned the %within% function is part of the lubridate package.

I've tried using it with a simple arrow dataset and get the following error message:

Error: Filter expression not supported for Arrow Datasets: date %within% int
Call collect() first to pull data into R.

... it looks like %within% isn't currently supported by arrow datasets.

Here is a reference for supported functions; you can access this via R help if you loaded the arrow library and do
?acero

on the web its at Functions available in Arrow dplyr queries — acero • Arrow R Package (apache.org)

so dplyr, between() is available

1 Like

Thanks @craig.parylo! That's exactly what I needed!

@nirgrahamuk, I am aware there are limits of what arrow can interpret in dplyr, but there are also so interesting little parsing nuances I've slowly been discovering.

Related to this, would anyone happen to know how to subtract from dates in arrow?

It keeps throwing:

Error in `compute.arrow_dplyr_query()`:
! NotImplemented: Function 'subtract_checked' has no kernel matching input types (timestamp[s], float)

And while I have seen a lot of issues in their Git Repo about this, I haven't seen a lot in the documentation about how to perform it.

its convoluted, but you seemingly can cast a Date to an integer, subtract 1 then cast back to a date.

library(tidyverse)
library(arrow)
df <- data.frame(
  ID = 1:10,
  Date = seq(as.Date("2024-01-01"), by = "month", length.out = 10),
  Value = rnorm(10, 50, 10)
)

df

tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))

write_dataset(df, tf, partitioning = "ID")


arrow_df <- open_dataset(tf)

arrow_df |> 
  filter(
    Date >= as.Date('2024-03-01'), 
    Date <= as.Date('2024-06-01')
  ) |> mutate(
    prev_dat = as.Date(as.integer(Date)-1))|> collect()
2 Likes

Wow.

Yeah, that is convoluted.

But it works beautifully and is considerably faster than what I was doing!

Thanks so much @nirgrahamuk!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.