Hi, all!
I am trying to summarise some data within duckdb using dplyr/dbplyr, and I am running into problems with functions not successfully translating into sql.
I understand that not everything is possible (I have read SQL translation • dbplyr), but I am simply asking for sums after filtering for a given condition -- this seems likely translatable to sql via dbplyr, I would think.
The crux of the code/issue is:
my_summary_2 <- con %>%
tbl("duck_df") %>%
summarise(across(
.cols = everything(),
.fns = list(
n = ~ n(),
n_na = ~ sum(is.na(.x)),
mean = ~ mean(.x, na.rm = TRUE),
n_lt_p = ~ sum(.x < p, na.rm = TRUE),
pct_lt_p = ~ mean(.x < p, na.rm = TRUE) * 100
)
))
my_summary_2 %>% collect() # boo!
With errors:
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT
COUNT(*) AS x_n,
SUM((x IS NULL)) AS x_n_na,
AVG(x) AS x_mean,
SUM(x < p) AS x_n_lt_p,
AVG(x < p) * 100.0 AS x_pct_lt_p,
COUNT(*) AS y_n,
SUM((y IS NULL)) AS y_n_na,
AVG(y) AS y_mean,
SUM(y < p) AS y_n_lt_p,
AVG(y < p) * 100.0 AS y_pct_lt_p,
COUNT(*) AS z_n,
SUM((z IS NULL)) AS z_n_na,
AVG(z) AS z_mean,
SUM(z < p) AS z_n_lt_p,
AVG(z < p) * 100.0 AS z_pct_lt_p
FROM duck_df
Error: Binder Error: No function matches the given name and argument types 'sum(BOOLEAN)'. You might need to add explicit type casts.
Candidate functions:
sum(DECIMAL) -> DECIMAL
sum(SMALLINT) -> HUGEINT
sum(INTEGER) -> HUGEINT
sum(BIGINT) -> HUGEINT
sum(HUGEINT) -> HUGEINT
sum(DOUBLE) -> DOUBLE
LINE 3: SUM((x IS NULL)) AS x_n_na,
I know this is possible in SQL somehow (COUNT case when...) but I am not veyr comfortable in SQL and would rather stay in dplyr-land if possible. Is there some way I need to rewrite my summary function that will enable appropriate translation?
Thank you!
Full code:
library(dplyr)
library(duckdb)
library(dbplyr)
possible_numbers <- na_if(c(1:4), 4)
df <- data.frame(
x = sample(possible_numbers, size = 10, replace = TRUE),
y = sample(possible_numbers, size = 10, replace = TRUE),
z = sample(possible_numbers, size = 10, replace = TRUE)
)
p <- 2
df %>%
summarise(across(
.cols = everything(),
.fns = list(
n = ~ n(),
n_na = ~ sum(is.na(.x)),
mean = ~ mean(.x, na.rm = TRUE),
n_lt_p = ~ sum(.x < p, na.rm = TRUE),
pct_lt_p = ~ mean(.x < p, na.rm = TRUE) * 100
)
))
con = dbConnect(duckdb(),
dbdir = "test.duckdb",
read_only = FALSE)
# duckdb_register(con, name = "duck_df", df = df)
dbWriteTable(con, "duck_df", df)
my_summary_1 <- con %>%
tbl("duck_df") %>%
summarise(across(.cols = everything(),
.fns = list(mean = ~ mean(.x, na.rm = TRUE))))
my_summary_1 %>% collect() # yay, works!
my_summary_2 <- con %>%
tbl("duck_df") %>%
summarise(across(
.cols = everything(),
.fns = list(
n = ~ n(),
n_na = ~ sum(is.na(.x)),
mean = ~ mean(.x, na.rm = TRUE),
n_lt_p = ~ sum(.x < p, na.rm = TRUE),
pct_lt_p = ~ mean(.x < p, na.rm = TRUE) * 100
)
))
my_summary_2 %>% collect() # boo, errors!