[Arrow with tidyverse] Calling min/max/mean with summarize on arrow tables

TPDeRamus · July 11, 2024, 12:37pm

Hi Posit Community.

I reached out to the arrow devs, but have not received a response regarding this request.

github.com/apache/arrow

[R] `dplyr` `summarize` commands do not accept `na.rm` arguments

opened 01:46PM - 10 Jul 24 UTC

TPDeramus

Component: R Type: usage

### Describe the usage question you have. Please include as many useful details …as possible. Hi Arrow devs. I wanted to ask about something I noticed about using the column-wise operators with `dplyr` in `arrow` tables. If I had an arrow table, and I wanted to run a basic function such as `mean`, `max`, or `min` using `summarize`, it appears that `arrow` does not currently accept the `na.rm = TRUE` argument, or that if it does, I can't seem to find it in the documentation. Say I took the original dataset: Producing: | Participant | Rating | | ------------ | -------- | | Donna | 17 | | Donna | NA | | Greg | 21 | | Greg | NA | For example, if these were generic `R` dataframes, either of these two calls would work (though one is deprecated): ``` data.frame( Participant = c('Greg', 'Greg', 'Donna', 'Donna'), Rating = c(21, NA, 17, NA) ) |> group_by(Participant) |> summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |> as.data.frame() data.frame( Participant = c('Greg', 'Greg', 'Donna', 'Donna'), Rating = c(21, NA, 17, NA) ) |> group_by(Participant) |> summarize(across(matches("Rating"), max, na.rm = TRUE)) |> as.data.frame() ``` Producing: | Participant | Rating | | ------------ | -------- | | Donna | 17 | | Greg | 21 | However, when I run the same commands as an arrow table, both throw errors: ``` data.frame( Participant = c('Greg', 'Greg', 'Donna', 'Donna'), Rating = c(21, NA, 17, NA) ) |> as_arrow_table() |> group_by(Participant) |> summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |> as.data.frame() Error in `across_setup()`: ! Anonymous functions are not yet supported in Arrow Run `rlang::last_trace()` to see where the error occurred. data.frame( Participant = c('Greg', 'Greg', 'Donna', 'Donna'), Rating = c(21, NA, 17, NA) ) |> as_arrow_table() |> group_by(Participant) |> summarize(across(matches("Rating"), max, na.rm = TRUE)) |> as.data.frame() Error in `expand_across()`: ! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow Run `rlang::last_trace()` to see where the error occurred. ``` And the one that does work: ``` data.frame( Participant = c('Greg', 'Greg', 'Donna', 'Donna'), Rating = c(21, NA, 17, NA) ) |> as_arrow_table() |> group_by(Participant) |> summarize(across(matches("Rating"), max)) |> as.data.frame() ``` Returns `NA` values that are not what I want: | Participant | Rating | | ------------ | -------- | | Donna | NA | | Greg | NA | Is there a way to pass the `na.rm = TRUE` argument to this call without having to manually drop the `NA` values for each column or row of interest I have in my data? ### Component(s) R

I have an arrow table, and I want to run some basic functions such as mean , max , or min across multiple repeating participants using summarize, but it appears that arrow does not currently accept the na.rm = TRUE argument (unless arrow_min does and I am missing something), or that if it does, I can't seem to find it in the documentation.

Say I took the original dataset:

Participant	Rating
Donna	17
Donna	NA
Greg	21
Greg	NA

If these were generic R dataframes, either of these two calls would work (though one is deprecated):

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Participant	Rating
Donna	17
Greg	21

However, when I run the same commands as an arrow table, both throw errors:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

And the one that does work:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max)) |>
  as.data.frame()

Returns NA values that are not what I want:

Participant	Rating
Donna	NA
Greg	NA

Is there a way to pass the na.rm = TRUE argument to this call without having to manually drop the NA values for each column or row of interest I have in my data?

arrow_max and the similar commands also do not appear to work with summarize.

Thank you to all in advance.

dromano · July 11, 2024, 12:49pm

As a temporary hack, you could try defining your own my_max() function and call that within your original code in place of your anonymous function.

dromano · July 11, 2024, 12:59pm

Just saw this:

here:

which may be the issue.

TPDeRamus · July 11, 2024, 1:08pm

Yeah I figured this was 90% of the issue, but I was hoping there was something here:

or here

That I'm just missing, as the latter seems to imply this should work.

In the actual workflow, I'm selecting my columns with across and matches, but willing to forgo that altogether just to get this na.rm argument working.

dromano · July 11, 2024, 1:13pm

Does you code work if you replace the across() column selection with a single column of interest?

nirgrahamuk · July 11, 2024, 1:40pm

Maybe it can be done by combining the arrow drop_null functionality with the sum they provide.

keithn · July 11, 2024, 2:27pm

dplyr::across() also supports a purrr-style lambda definition, which strangely seems to work in arrow where the other methods failed.

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), ~max(.x, na.rm = TRUE))) |>
  as.data.frame()
##   Participant Rating
## 1        Greg     21
## 2       Donna     17

dromano · July 11, 2024, 2:44pm

Very interesting, like it's relying on an older version of R. Might be worth a check with the developers.

keithn · July 11, 2024, 2:53pm

This might be wise. I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.

With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?

TPDeRamus · July 11, 2024, 8:12pm

What @dromano said.

Interesting this method works.

Saves me many hours of calculations as well!

Thanks so much!

I'll pass this on to the Arrow devs in that same thread.

thisisnic · July 13, 2024, 12:15pm

Glad to see you got a solution here. For a bit of extra context, the problem likely lies in how the arrow code is converting the across() call and perhaps missing something there. Another alternative workaround (but this is inelegant IMO) is:

max_na <- function(x) max(x, na.rm = TRUE)

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max_na)) |>
  as.data.frame()

I implemented most of the code in the arrow version of across() and will take a look at fixing this at some point, though may be a while as life is pretty busy right now! Thanks everyone who jumped in here with ideas and a solution

Yong_Wang · July 15, 2024, 6:27am

arrow package has a function to solve this. use to_duckdb instead of as_arrow_table.

For you code, just change this part, all thinks will be ok.


#------------Arrow duck -----------------
> data.frame(
+   Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+   Rating = c(21, NA, 17, NA)
+ ) |>
+   to_duckdb() |>
+   group_by(Participant) |>
+   summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
+   as.data.frame()
  Participant Rating
1        Greg     21
2       Donna     17
> data.frame(
+   Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+   Rating = c(21, NA, 17, NA)
+ ) |>
+   to_duckdb() |>
+   group_by(Participant) |>
+   summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
+   as.data.frame()
  Participant Rating
1       Donna     17
2        Greg     21
警告信息:
The `...` argument of `across()` is deprecated as of dbplyr 2.3.0.
ℹ Supply arguments directly to `.fns` through a lambda instead.

# Previously across(a:b, mean, na.rm = TRUE)

# Now across(a:b, ~mean(.x, na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated

my arrow, duckdb and dplyr version for reference:

packageVersion('dplyr')
[1] ‘1.1.4’
packageVersion('arrow')
[1] ‘16.1.0’
packageVersion('dplyr')
[1] ‘1.1.4’

To sum up, as far as I tried in some project play with large data. Arrow has very good performance to store and extract raw data however it has limited sql function implementation. For duckdb, it is high performance computation library which support sql-92, sql-2003(part), sql-2011(part) which use arrow as low level storage engine.
Use arrow + duckdb, enjoy the high performance and overcome such sql compatible issue.

good luck.

system · July 22, 2024, 6:28am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.