arrow package has a function to solve this. use to_duckdb instead of as_arrow_table.
For you code, just change this part, all thinks will be ok.
#------------Arrow duck -----------------
> data.frame(
+ Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+ Rating = c(21, NA, 17, NA)
+ ) |>
+ to_duckdb() |>
+ group_by(Participant) |>
+ summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
+ as.data.frame()
Participant Rating
1 Greg 21
2 Donna 17
> data.frame(
+ Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
+ Rating = c(21, NA, 17, NA)
+ ) |>
+ to_duckdb() |>
+ group_by(Participant) |>
+ summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
+ as.data.frame()
Participant Rating
1 Donna 17
2 Greg 21
警告信息:
The `...` argument of `across()` is deprecated as of dbplyr 2.3.0.
ℹ Supply arguments directly to `.fns` through a lambda instead.
# Previously across(a:b, mean, na.rm = TRUE)
# Now across(a:b, ~mean(.x, na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated
my arrow, duckdb and dplyr version for reference:
packageVersion('dplyr')
[1] ‘1.1.4’
packageVersion('arrow')
[1] ‘16.1.0’
packageVersion('dplyr')
[1] ‘1.1.4’
To sum up, as far as I tried in some project play with large data. Arrow has very good performance to store and extract raw data however it has limited sql function implementation. For duckdb, it is high performance computation library which support sql-92, sql-2003(part), sql-2011(part) which use arrow as low level storage engine.
Use arrow + duckdb, enjoy the high performance and overcome such sql compatible issue.
good luck.