[Arrow with tidyverse] Calling min/max/mean with summarize on arrow tables

Hi Posit Community.

I reached out to the arrow devs, but have not received a response regarding this request.

I have an arrow table, and I want to run some basic functions such as mean , max , or min across multiple repeating participants using summarize, but it appears that arrow does not currently accept the na.rm = TRUE argument (unless arrow_min does and I am missing something), or that if it does, I can't seem to find it in the documentation.

Say I took the original dataset:

Participant Rating
Donna 17
Donna NA
Greg 21
Greg NA

If these were generic R dataframes, either of these two calls would work (though one is deprecated):

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()
Participant Rating
Donna 17
Greg 21

However, when I run the same commands as an arrow table, both throw errors:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

And the one that does work:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max)) |>
  as.data.frame()

Returns NA values that are not what I want:

Participant Rating
Donna NA
Greg NA

Is there a way to pass the na.rm = TRUE argument to this call without having to manually drop the NA values for each column or row of interest I have in my data?

arrow_max and the similar commands also do not appear to work with summarize.

Thank you to all in advance.

As a temporary hack, you could try defining your own my_max() function and call that within your original code in place of your anonymous function.

Just saw this:


here:

which may be the issue.

1 Like

Yeah I figured this was 90% of the issue, but I was hoping there was something here:

or here

That I'm just missing, as the latter seems to imply this should work.

In the actual workflow, I'm selecting my columns with across and matches, but willing to forgo that altogether just to get this na.rm argument working.

Does you code work if you replace the across() column selection with a single column of interest?

Maybe it can be done by combining the arrow drop_null functionality with the sum they provide.

dplyr::across() also supports a purrr-style lambda definition, which strangely seems to work in arrow where the other methods failed.

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), ~max(.x, na.rm = TRUE))) |>
  as.data.frame()
##   Participant Rating
## 1        Greg     21
## 2       Donna     17
2 Likes

Very interesting, like it's relying on an older version of R. Might be worth a check with the developers.

This might be wise. I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.

With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?

1 Like

What @dromano said.

Interesting this method works.

Saves me many hours of calculations as well!

Thanks so much!

I'll pass this on to the Arrow devs in that same thread.

1 Like

Glad to see you got a solution here. For a bit of extra context, the problem likely lies in how the arrow code is converting the across() call and perhaps missing something there. Another alternative workaround (but this is inelegant IMO) is:

max_na <- function(x) max(x, na.rm = TRUE)

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max_na)) |>
  as.data.frame()

I implemented most of the code in the arrow version of across() and will take a look at fixing this at some point, though may be a while as life is pretty busy right now! Thanks everyone who jumped in here with ideas and a solution :slight_smile:

1 Like

arrow package has a function to solve this. use to_duckdb instead of as_arrow_table.

For you code, just change this part, all thinks will be ok.


#------------Arrow duck -----------------
> data.frame(
+   Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+   Rating = c(21, NA, 17, NA)
+ ) |>
+   to_duckdb() |>
+   group_by(Participant) |>
+   summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
+   as.data.frame()
  Participant Rating
1        Greg     21
2       Donna     17
> data.frame(
+   Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+   Rating = c(21, NA, 17, NA)
+ ) |>
+   to_duckdb() |>
+   group_by(Participant) |>
+   summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
+   as.data.frame()
  Participant Rating
1       Donna     17
2        Greg     21
警告信息:
The `...` argument of `across()` is deprecated as of dbplyr 2.3.0.
ℹ Supply arguments directly to `.fns` through a lambda instead.

# Previously across(a:b, mean, na.rm = TRUE)

# Now across(a:b, ~mean(.x, na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated

my arrow, duckdb and dplyr version for reference:

packageVersion('dplyr')
[1] ‘1.1.4’
packageVersion('arrow')
[1] ‘16.1.0’
packageVersion('dplyr')
[1] ‘1.1.4’

To sum up, as far as I tried in some project play with large data. Arrow has very good performance to store and extract raw data however it has limited sql function implementation. For duckdb, it is high performance computation library which support sql-92, sql-2003(part), sql-2011(part) which use arrow as low level storage engine.
Use arrow + duckdb, enjoy the high performance and overcome such sql compatible issue.

good luck.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.