I reached out to the arrow devs, but have not received a response regarding this request.
I have an arrow table, and I want to run some basic functions such as mean , max , or min across multiple repeating participants using summarize, but it appears that arrow does not currently accept the na.rm = TRUE argument (unless arrow_min does and I am missing something), or that if it does, I can't seem to find it in the documentation.
Say I took the original dataset:
Participant
Rating
Donna
17
Donna
NA
Greg
21
Greg
NA
If these were generic R dataframes, either of these two calls would work (though one is deprecated):
However, when I run the same commands as an arrow table, both throw errors:
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
as.data.frame()
Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
as.data.frame()
Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.
Is there a way to pass the na.rm = TRUE argument to this call without having to manually drop the NA values for each column or row of interest I have in my data?
arrow_max and the similar commands also do not appear to work with summarize.
Yeah I figured this was 90% of the issue, but I was hoping there was something here:
or here
That I'm just missing, as the latter seems to imply this should work.
In the actual workflow, I'm selecting my columns with across and matches, but willing to forgo that altogether just to get this na.rm argument working.
This might be wise. I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.
With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?
Glad to see you got a solution here. For a bit of extra context, the problem likely lies in how the arrow code is converting the across() call and perhaps missing something there. Another alternative workaround (but this is inelegant IMO) is:
I implemented most of the code in the arrow version of across() and will take a look at fixing this at some point, though may be a while as life is pretty busy right now! Thanks everyone who jumped in here with ideas and a solution
To sum up, as far as I tried in some project play with large data. Arrow has very good performance to store and extract raw data however it has limited sql function implementation. For duckdb, it is high performance computation library which support sql-92, sql-2003(part), sql-2011(part) which use arrow as low level storage engine.
Use arrow + duckdb, enjoy the high performance and overcome such sql compatible issue.