[Arrow with tidyverse] Calling min/max/mean with summarize on arrow tables

Yong_Wang · July 15, 2024, 6:27am

arrow package has a function to solve this. use to_duckdb instead of as_arrow_table.

For you code, just change this part, all thinks will be ok.


#------------Arrow duck -----------------
> data.frame(
+   Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+   Rating = c(21, NA, 17, NA)
+ ) |>
+   to_duckdb() |>
+   group_by(Participant) |>
+   summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
+   as.data.frame()
  Participant Rating
1        Greg     21
2       Donna     17
> data.frame(
+   Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+   Rating = c(21, NA, 17, NA)
+ ) |>
+   to_duckdb() |>
+   group_by(Participant) |>
+   summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
+   as.data.frame()
  Participant Rating
1       Donna     17
2        Greg     21
警告信息:
The `...` argument of `across()` is deprecated as of dbplyr 2.3.0.
ℹ Supply arguments directly to `.fns` through a lambda instead.

# Previously across(a:b, mean, na.rm = TRUE)

# Now across(a:b, ~mean(.x, na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated

my arrow, duckdb and dplyr version for reference:

packageVersion('dplyr')
[1] ‘1.1.4’
packageVersion('arrow')
[1] ‘16.1.0’
packageVersion('dplyr')
[1] ‘1.1.4’

To sum up, as far as I tried in some project play with large data. Arrow has very good performance to store and extract raw data however it has limited sql function implementation. For duckdb, it is high performance computation library which support sql-92, sql-2003(part), sql-2011(part) which use arrow as low level storage engine.
Use arrow + duckdb, enjoy the high performance and overcome such sql compatible issue.

good luck.