Hi Posit users.
I submitted an issue about this to the arrow
devs:
opened 01:23PM - 19 Aug 24 UTC
Type: bug
Component: R
### Describe the bug, including details regarding any error messages, version, a… nd platform.
Hi Arrow Devs.
Not sure if this would be more of a usage question or a bug, but it certainly seemed to behave like the latter.
I have a large amount of data that needs to be queried in a specific way using a combination of hard (exact matches) and soft (begins with) string matches based on another condition (e.g. value of "class" in another column).
I was recently able to get this working via `if_else` with many other singular instances where the queries were smaller, but with this larger dataset I need to recruit `case_when` for a set of larger, conditional mutations based on which "database" the query belongs to.
However, every time I do this, the package seems to throw an error that it cannot be run in `arrow` and needs to be pulled into R:
```
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in( identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ EndDate, database == "primary" ~ as.Date(as.integer(BeginDate) + 30), .default = Foundate))
Error: Expression case_when((((arrow_is_in(identity_string, options = list(value_set = Array$create(HardA), ... not supported in Arrow
Call collect() first to pull data into R
```
I assumed this was some kind of issue with doing a date conversion in `arrow`, which I realize does not have hard-mapped `tidyverse` commands for doing mathematical operations on dates (e.g. I need to add 30 days to BeginDate and the like, hence why I used Foundate as an imputed placeholder that would not be possible to remove later.
However, oddly enough, every single one of these other options seem to work:
```
#Works (adding to date permitted, no case_when)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Logdate = if_else(database == "primary", as.Date(as.integer(BeginDate) + 30), EndDate)) |> as_arrow_table()
#Works (no case_when, no math performed on date)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Logdate = if_else((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in(identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate, NA)) |> as_arrow_table()
#Works (no case when, adding to date, permits operations on date)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else(database == "primary", as.Date(as.integer(BeginDate) + 30), Foundate)) |> as_arrow_table()
#This Works(HAS case_when, but value for met criteria is an integer, not a date)
> FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in( identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ 1, database == "primary" ~ 0, .default = 2)) |> as_arrow_table()
#This Works (no case_when, saves as date with no math)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in(identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate, Foundate)) |> as_arrow_table()
```
Am I facing some sort of syntax issue or is there a limitation when attempting to set `case_when` conditions with regards to dates?
Thank you in advance for any suggestions.
### Component(s)
R
But this is not dissimilar to something I asked in one of my previous questions and I found it interesting that it was behaving one way with if_else
and another with case_when
and was wondering if anyone had any advice or had seen this before.
I have a large amount of data that needs to be queried in a specific way using a combination of hard (exact matches) and soft (begins with) string matches based on another condition (e.g. value of "class" in another column).
I was recently able to get this working via if_else
with many other singular instances where the queries were smaller, but with this larger dataset I need to recruit case_when
for a set of larger, conditional mutations based on which "database" the query belongs to.
However, every time I do this, the package seems to throw an error that it cannot be run in arrow
and needs to be pulled into R:
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in( identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ EndDate, database == "primary" ~ as.Date(as.integer(BeginDate) + 30), .default = Foundate))
Error: Expression case_when((((arrow_is_in(identity_string, options = list(value_set = Array$create(HardA), ... not supported in Arrow
Call collect() first to pull data into R
I assumed this was some kind of issue with doing a date conversion in arrow
, which I realize does not have hard-mapped tidyverse
commands for doing mathematical operations on dates (e.g. I need to add 30 days to BeginDate and the like, hence why I used Foundate as an imputed placeholder that would not be possible to remove later.
However, oddly enough, every single one of these other options seem to work:
#Works (adding to date permitted, no case_when)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Logdate = if_else(database == "primary", as.Date(as.integer(BeginDate) + 30), EndDate)) |> as_arrow_table()
#Works (no case_when, no math performed on date)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Logdate = if_else((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in(identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate, NA)) |> as_arrow_table()
#Works (no case when, adding to date, permits operations on date)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else(database == "primary", as.Date(as.integer(BeginDate) + 30), Foundate)) |> as_arrow_table()
#This Works(HAS case_when, but value for met criteria is an integer, not a date)
> FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in( identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ 1, database == "primary" ~ 0, .default = 2)) |> as_arrow_table()
#This Works (no case_when, saves as date with no math)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else((((class == "A" & arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | (class == "B" & arrow_is_in(identity_string, options = list(value_set = Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & arrow_is_in(identity_string, options = list(value_set = Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate, Foundate)) |> as_arrow_table()
Am I facing some sort of syntax issue or is there a limitation when attempting to set case_when
conditions with regards to dates?
Based on a previous thread:
arrow package has a function to solve this. use to_duckdb instead of as_arrow_table.
For you code, just change this part, all thinks will be ok.
#------------Arrow duck -----------------
> data.frame(
+ Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+ Rating = c(21, NA, 17, NA)
+ ) |>
+ to_duckdb() |>
+ group_by(Participant) |>
+ summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
+ as.data.frame()
Participant Rating
1 Greg 21
2 Donna 17
> data.frame(
+ Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+ Rating = c(21, NA, 17, NA)
+ ) |>
+ to_duckdb() |>
+ group_by(Participant) |>
+ summarize(across(matches("Rating"), max,…
I think it might be another SQL limitation?
Thank you in advance for any suggestions!