Hi Posit Community.
I reached out to the arrow
devs, but have not received a response regarding this request.
I have an arrow table, and I want to run some basic functions such as mean
, max
, or min
across multiple repeating participants using summarize
, but it appears that arrow
does not currently accept the na.rm = TRUE
argument (unless arrow_min
does and I am missing something), or that if it does, I can't seem to find it in the documentation.
Say I took the original dataset:
Participant | Rating |
---|---|
Donna | 17 |
Donna | NA |
Greg | 21 |
Greg | NA |
If these were generic R
dataframes, either of these two calls would work (though one is deprecated):
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
group_by(Participant) |>
summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
as.data.frame()
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
group_by(Participant) |>
summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
as.data.frame()
Participant | Rating |
---|---|
Donna | 17 |
Greg | 21 |
However, when I run the same commands as an arrow table, both throw errors:
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
as.data.frame()
Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
as.data.frame()
Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.
And the one that does work:
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), max)) |>
as.data.frame()
Returns NA
values that are not what I want:
Participant | Rating |
---|---|
Donna | NA |
Greg | NA |
Is there a way to pass the na.rm = TRUE
argument to this call without having to manually drop the NA
values for each column or row of interest I have in my data?
arrow_max
and the similar commands also do not appear to work with summarize
.
Thank you to all in advance.