[duckplyr/duckdb] Bad SQL call to dplyr with `rowwise/groupby()` and `across()`

Hi Posit users.

Having issues utilizing rowwise/group_by() with an across() call when employing duckdb/dbplyr

Specifically, given data like the following:

df <- structure(list(ID = c("000000", "000000", "000001", "000002", 
"000002", "000002", "000003", "000004", "000004", "000004"), 
    date = structure(c(19039, 10855, 11458, 16996, 11592, 12133, 
    11678, 12560, 17016, 18938), class = "Date"), Var_00 = c(0, 
    0, 0, 0, 1, 1, 1, 1, 1, 0), Var_01 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_02 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Var_03 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_04 = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), Var_05 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_06 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_07 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_08 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_09 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

I'd like to produce output like this:

dput(dfout)
structure(list(ID = c("000000", "000000", "000001", "000002", 
"000002", "000002", "000003", "000004", "000004", "000004"), 
    date = structure(c(19039, 10855, 11458, 16996, 11592, 12133, 
    11678, 12560, 17016, 18938), class = "Date"), Var_00 = c(0, 
    0, 0, 0, 1, 1, 1, 1, 1, 0), Var_01 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_02 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Var_03 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_04 = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), Var_05 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_06 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_07 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_08 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_09 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), sum = c(0, 0, 0, 0, 5, 5, 5, 
    5, 5, 0)), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(.rows = structure(list(
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))

Where the sum column is the added total by row across all columns for a given row, using duckdb and duckplyr, as the actual data is on the order of a few million rows.

In standard dplyr, I would use a call like one of the following:

out <- df |> rowwise() |> mutate(sum = sum(across(all_of(Var_list))))

out <- df |> rowwise() |> mutate(sum = sum(across(starts_with("Var"))))

out <- df |> group_by(ID,date) |> mutate(sum = sum(across(all_of(Var_list))))

out <- df |> group_by(ID,date) |> mutate(sum = sum(across(starts_with("Var"))))

However, rowwise() doesn't seem to be implemented in duckplyr()

out <- df |> to_duckdb() |> rowwise() |> mutate(sum = sum(across(all_of(Var_list))))
Error in UseMethod("rowwise") : 
  no applicable method for 'rowwise' applied to an object of class "c('tbl_duckdb_connection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

out <- df |> to_duckdb() |> rowwise() |> mutate(sum = sum(across(starts_with("Var"))))
Error in UseMethod("rowwise") : 
  no applicable method for 'rowwise' applied to an object of class "c('tbl_duckdb_connection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

And using group_by() when the data is converted to a duckdb throws a SQL interpretation error I haven't quite understood the suggested conversion for:

> out <- df |> to_duckdb() |> group_by(ID,date) |> mutate(sum = sum(across(all_of(Var_list))))
> out
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
---
Backtrace:
     ▆
  1. ├─base (local) `<fn>`(x)
  2. ├─dbplyr:::print.tbl_sql(x)
  3. │ ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
  4. │ │ ├─base::cat(paste0(..., "\n"), sep = "")
  5. │ │ └─base::paste0(..., "\n")
  6. │ ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
  7. │ └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8. │   └─pillar:::format_tbl(...)
  9. │     └─pillar::tbl_format_setup(...)
 10. │       ├─pillar:::tbl_format_setup_dispatch(...)
 11. │       └─pillar:::tbl_format_setup.tbl(...)
 12. │         └─pillar:::df_head(x, n + 1)
 13. │           ├─base::as.data.frame(head(x, n))
 14. │           └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
 15. │             ├─base::as.data.frame(collect(x, n = n))
 16. │             ├─dplyr::collect(x, n = n)
 17. │             └─dbplyr:::collect.tbl_sql(x, n = n)
 18. │               └─dbplyr::db_sql_render(x$src$con, x, cte = cte)
 19. │                 ├─dbplyr::db_sql_render(con, sql, ..., sql_options = sql_options)
 20. │                 └─dbplyr:::db_sql_render.DBIConnection(con, sql, ..., sql_options = sql_options)
 21. │                   ├─dbplyr::sql_render(sql, con = con, ..., sql_options = sql_options)
 22. │                   └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., sql_options = sql_options)
 23. │                     ├─dbplyr::sql_render(...)
 24. │                     └─dbplyr:::sql_render.lazy_query(...)
 25. │                       ├─dbplyr::sql_build(query, con = con, sql_options = sql_options)
 26. │                       └─dbplyr:::sql_build.lazy_select_query(query, con = con, sql_options = sql_options)
 27. │                         └─dbplyr:::get_select_sql(...)
 28. │                           └─dbplyr:::translate_select_sql(con, select)
 29. │                             └─dbplyr::translate_sql_(...)
 30. │                               └─base::lapply(...)
 31. │                                 └─dbplyr (local) FUN(X[[i]], ...)
 32. │                                   ├─dbplyr::escape(eval_tidy(x, mask), con = con)
 33. │                                   └─rlang::eval_tidy(x, mask)
 34. ├─dbplyr (local) sum(`<named list>`)
 35. │ ├─dbplyr::win_over(...)
 36. │ │ └─dbplyr:::glue_sql2(con, "{.val expr} OVER {.val over}")
 37. │ │   ├─dbplyr::sql(...)
 38. │ │   │ └─dbplyr:::c_character(...)
 39. │ │   └─glue::glue(...)
 40. │ │     └─glue::glue_data(...)
 41. │ └─dbplyr:::glue_sql2(sql_current_con(), "{f}({.val x})")
 42. │   ├─dbplyr::sql(...)
 43. │   │ └─dbplyr:::c_character(...)
 44. │   └─glue::glue(...)
 45. │     └─glue::glue_data(...)
 46. ├─glue (local) `<fn>`(".val expr")
 47. │ ├─.transformer(expr, env) %||% .null
 48. │ └─dbplyr (local) .transformer(expr, env)
 49. │   └─base::eval(parse(text = value, keep.source = FALSE), envir)
 50. │     └─base::eval(parse(text = value, keep.source = FALSE), envir)
 51. └─glue (local) `<fn>`(".val x")
 52.   ├─.transformer(expr, env) %||% .null
 53.   └─dbplyr (local) .transformer(expr, env)
 54.     ├─dbplyr::escape(value, con = connection)
 55.     └─dbplyr:::escape.list(value, con = connection)
 56.       └─base::vapply(x, escape, character(1), con = con)
 57.         ├─dbplyr (local) FUN(X[[i]], ...)
 58.         └─dbplyr:::escape.default(X[[i]], ...)
 59.           └─dbplyr:::error_embed(obj_type_friendly(x), "x")
Run rlang::last_trace(drop = FALSE) to see 2 hidden frames.

(see below for hidden frame output of similar command)

> out <- df |> to_duckdb() |> group_by(ID,date) |> mutate(sum = sum(across(starts_with("Var"))))
> out
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
---
Backtrace:
     ▆
  1. ├─base (local) `<fn>`(x)
  2. ├─dbplyr:::print.tbl_sql(x)
  3. │ ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
  4. │ │ ├─base::cat(paste0(..., "\n"), sep = "")
  5. │ │ └─base::paste0(..., "\n")
  6. │ ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
  7. │ └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8. │   └─pillar:::format_tbl(...)
  9. │     └─pillar::tbl_format_setup(...)
 10. │       ├─pillar:::tbl_format_setup_dispatch(...)
 11. │       └─pillar:::tbl_format_setup.tbl(...)
 12. │         └─pillar:::df_head(x, n + 1)
 13. │           ├─base::as.data.frame(head(x, n))
 14. │           └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
 15. │             ├─base::as.data.frame(collect(x, n = n))
 16. │             ├─dplyr::collect(x, n = n)
 17. │             └─dbplyr:::collect.tbl_sql(x, n = n)
 18. │               └─dbplyr::db_sql_render(x$src$con, x, cte = cte)
 19. │                 ├─dbplyr::db_sql_render(con, sql, ..., sql_options = sql_options)
 20. │                 └─dbplyr:::db_sql_render.DBIConnection(con, sql, ..., sql_options = sql_options)
 21. │                   ├─dbplyr::sql_render(sql, con = con, ..., sql_options = sql_options)
 22. │                   └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., sql_options = sql_options)
 23. │                     ├─dbplyr::sql_render(...)
 24. │                     └─dbplyr:::sql_render.lazy_query(...)
 25. │                       ├─dbplyr::sql_build(query, con = con, sql_options = sql_options)
 26. │                       └─dbplyr:::sql_build.lazy_select_query(query, con = con, sql_options = sql_options)
 27. │                         └─dbplyr:::get_select_sql(...)
 28. │                           └─dbplyr:::translate_select_sql(con, select)
 29. │                             └─dbplyr::translate_sql_(...)
 30. │                               └─base::lapply(...)
 31. │                                 └─dbplyr (local) FUN(X[[i]], ...)
 32. │                                   ├─dbplyr::escape(eval_tidy(x, mask), con = con)
 33. │                                   └─rlang::eval_tidy(x, mask)
 34. ├─dbplyr (local) sum(`<named list>`)
 35. │ ├─dbplyr::win_over(...)
 36. │ │ └─dbplyr:::glue_sql2(con, "{.val expr} OVER {.val over}")
 37. │ │   ├─dbplyr::sql(...)
 38. │ │   │ └─dbplyr:::c_character(...)
 39. │ │   └─glue::glue(...)
 40. │ │     └─glue::glue_data(...)
 41. │ └─dbplyr:::glue_sql2(sql_current_con(), "{f}({.val x})")
 42. │   ├─dbplyr::sql(...)
 43. │   │ └─dbplyr:::c_character(...)
 44. │   └─glue::glue(...)
 45. │     └─glue::glue_data(...)
 46. ├─glue (local) `<fn>`(".val expr")
 47. │ ├─.transformer(expr, env) %||% .null
 48. │ └─dbplyr (local) .transformer(expr, env)
 49. │   └─base::eval(parse(text = value, keep.source = FALSE), envir)
 50. │     └─base::eval(parse(text = value, keep.source = FALSE), envir)
 51. └─glue (local) `<fn>`(".val x")
 52.   ├─.transformer(expr, env) %||% .null
 53.   └─dbplyr (local) .transformer(expr, env)
 54.     ├─dbplyr::escape(value, con = connection)
 55.     └─dbplyr:::escape.list(value, con = connection)
 56.       └─base::vapply(x, escape, character(1), con = con)
 57.         ├─dbplyr (local) FUN(X[[i]], ...)
 58.         └─dbplyr:::escape.default(X[[i]], ...)
 59.           └─dbplyr:::error_embed(obj_type_friendly(x), "x")
Run rlang::last_trace(drop = FALSE) to see 2 hidden frames.
> rlang::last_trace(drop = FALSE)
<error/rlang_error>
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
---
Backtrace:
     ▆
  1. ├─base (local) `<fn>`(x)
  2. ├─dbplyr:::print.tbl_sql(x)
  3. │ ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
  4. │ │ ├─base::cat(paste0(..., "\n"), sep = "")
  5. │ │ └─base::paste0(..., "\n")
  6. │ ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
  7. │ └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8. │   └─pillar:::format_tbl(...)
  9. │     └─pillar::tbl_format_setup(...)
 10. │       ├─pillar:::tbl_format_setup_dispatch(...)
 11. │       └─pillar:::tbl_format_setup.tbl(...)
 12. │         └─pillar:::df_head(x, n + 1)
 13. │           ├─base::as.data.frame(head(x, n))
 14. │           └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
 15. │             ├─base::as.data.frame(collect(x, n = n))
 16. │             ├─dplyr::collect(x, n = n)
 17. │             └─dbplyr:::collect.tbl_sql(x, n = n)
 18. │               └─dbplyr::db_sql_render(x$src$con, x, cte = cte)
 19. │                 ├─dbplyr::db_sql_render(con, sql, ..., sql_options = sql_options)
 20. │                 └─dbplyr:::db_sql_render.DBIConnection(con, sql, ..., sql_options = sql_options)
 21. │                   ├─dbplyr::sql_render(sql, con = con, ..., sql_options = sql_options)
 22. │                   └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., sql_options = sql_options)
 23. │                     ├─dbplyr::sql_render(...)
 24. │                     └─dbplyr:::sql_render.lazy_query(...)
 25. │                       ├─dbplyr::sql_build(query, con = con, sql_options = sql_options)
 26. │                       └─dbplyr:::sql_build.lazy_select_query(query, con = con, sql_options = sql_options)
 27. │                         └─dbplyr:::get_select_sql(...)
 28. │                           └─dbplyr:::translate_select_sql(con, select)
 29. │                             └─dbplyr::translate_sql_(...)
 30. │                               └─base::lapply(...)
 31. │                                 └─dbplyr (local) FUN(X[[i]], ...)
 32. │                                   ├─dbplyr::escape(eval_tidy(x, mask), con = con)
 33. │                                   └─rlang::eval_tidy(x, mask)
 34. ├─dbplyr (local) sum(`<named list>`)
 35. │ ├─dbplyr::win_over(...)
 36. │ │ └─dbplyr:::glue_sql2(con, "{.val expr} OVER {.val over}")
 37. │ │   ├─dbplyr::sql(...)
 38. │ │   │ └─dbplyr:::c_character(...)
 39. │ │   └─glue::glue(...)
 40. │ │     └─glue::glue_data(...)
 41. │ └─dbplyr:::glue_sql2(sql_current_con(), "{f}({.val x})")
 42. │   ├─dbplyr::sql(...)
 43. │   │ └─dbplyr:::c_character(...)
 44. │   └─glue::glue(...)
 45. │     └─glue::glue_data(...)
 46. ├─glue (local) `<fn>`(".val expr")
 47. │ ├─.transformer(expr, env) %||% .null
 48. │ └─dbplyr (local) .transformer(expr, env)
 49. │   └─base::eval(parse(text = value, keep.source = FALSE), envir)
 50. │     └─base::eval(parse(text = value, keep.source = FALSE), envir)
 51. └─glue (local) `<fn>`(".val x")
 52.   ├─.transformer(expr, env) %||% .null
 53.   └─dbplyr (local) .transformer(expr, env)
 54.     ├─dbplyr::escape(value, con = connection)
 55.     └─dbplyr:::escape.list(value, con = connection)
 56.       └─base::vapply(x, escape, character(1), con = con)
 57.         ├─dbplyr (local) FUN(X[[i]], ...)
 58.         └─dbplyr:::escape.default(X[[i]], ...)
 59.           └─dbplyr:::error_embed(obj_type_friendly(x), "x")
 60.             └─cli::cli_abort(...)
 61.               └─rlang::abort(...)

I'm assuming this is some sort of R to SQL translation on the backend that's getting lost that can be fixed with a direct call to one or the other in some way, but it isn't intuitive to me how to force evaluation in R with (e.g.) `!!x` or `local(x)`? as it were.

Would you happen to have any suggestions or an example as to how I could perform or get around this?

Thank you in advance!

Edit:
Forgot loaded packages.
Apologies

microbenchmark_1.4.10
profmem_0.6.0
fastDummies_1.7.3
glue_1.7.0
data.table_1.15.4
janitor_2.2.0
readxl_1.4.3
reticulate_1.38.0
parquetize_0.5.7
duckplyr_0.4.1
duckdb_1.0.0-2
DBI_1.2.3
arrow_16.1.0
haven_2.5.4
fs_1.6.4
lubridate_1.9.3
forcats_1.0.0
stringr_1.5.1
dplyr_1.1.4
purrr_1.0.2
readr_2.1.5
tidyr_1.3.1
tibble_3.2.1
ggplot2_3.5.1
tidyverse_2.0.0

I wasn't sure which packages were being used here. to_duckdb comes from the arrow package? In any case, my solution here focuses on using dbplyr with duckdb as I haven't yet worked with duckplyr. I also changed your data slightly to make sure it was working as intended. First, I don't think you need to use any grouping for what you want to accomplish - if your actual example differs that much then it should be easy to include. You can use rowSums or Reduce to accomplish what you need in-memory (no database), and then taking the idea of Reduce we can make an equivalent sql statement. Reduce can't be translated to SQL using dbplyr but we can generate a "Var_01+Var_02+...." string and plug that in the sql function to hopefully get what you need.

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Data
df <- structure(
  list(
    ID = c(
      "000000",
      "000000",
      "000001",
      "000002",
      "000002",
      "000002",
      "000003",
      "000004",
      "000004",
      "000004"
    ),
    date = structure(
      c(
        19039,
        10855,
        11458,
        16996,
        11592,
        12133,
        11678,
        12560,
        17016,
        18938
      ),
      class = "Date"
    ),
    Var_00 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0),
    Var_01 = c(0, 0, 0, 0, 1, 1, 0, 1, 1, 0),
    Var_02 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Var_03 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0),
    Var_04 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Var_05 = c(0, 0, 1, 0, 1, 1, 0, 1, 1, 0),
    Var_06 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Var_07 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0),
    Var_08 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0),
    Var_09 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
  ),
  row.names = c(NA,-10L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

Var_list = c(
  "Var_00",
  "Var_01",
  "Var_02",
  "Var_03",
  "Var_04",
  "Var_05",
  "Var_06",
  "Var_07",
  "Var_08",
  "Var_09"
)

# Desired output
out = df |> 
  rowwise() |> 
  mutate(sum = sum(across(all_of(Var_list)))) |> 
  ungroup()
print(out)
#> # A tibble: 10 × 13
#>    ID     date       Var_00 Var_01 Var_02 Var_03 Var_04 Var_05 Var_06 Var_07
#>    <chr>  <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 000000 2022-02-16      0      0      0      0      0      0      0      0
#>  2 000000 1999-09-21      0      0      0      0      0      0      0      0
#>  3 000001 2001-05-16      0      0      0      1      0      1      0      0
#>  4 000002 2016-07-14      0      0      0      0      0      0      0      0
#>  5 000002 2001-09-27      1      1      0      0      0      1      0      1
#>  6 000002 2003-03-22      1      1      0      0      0      1      0      1
#>  7 000003 2001-12-22      1      0      0      0      0      0      0      1
#>  8 000004 2004-05-22      1      1      0      0      0      1      0      1
#>  9 000004 2016-08-03      1      1      0      0      0      1      0      1
#> 10 000004 2021-11-07      0      0      0      0      0      0      0      0
#> # ℹ 3 more variables: Var_08 <dbl>, Var_09 <dbl>, sum <dbl>

# But no grouping is needed
## rowSums
out_nogroup = df |> 
  mutate(sum = rowSums(across(all_of(Var_list))))

## Reduce
out_reduce = df |> 
  mutate(sum = Reduce( `+`, across(all_of(Var_list))))

waldo::compare(out, out_nogroup) # same
#> ✔ No differences
waldo::compare(out, out_reduce) # same
#> ✔ No differences


# Use duckdb and dbplyr
con = dbConnect(duckdb())
dbWriteTable(con, name = "df", df)

Var_list_expr = paste0(Var_list, collapse = " + ")
print(Var_list_expr)
#> [1] "Var_00 + Var_01 + Var_02 + Var_03 + Var_04 + Var_05 + Var_06 + Var_07 + Var_08 + Var_09"

out_duck = tbl(con, "df") |> 
  mutate(sum = sql(Var_list_expr))
class(out_duck)
#> [1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
#> [4] "tbl_lazy"              "tbl"
out_duck |> explain() # plan
#> <SQL>
#> SELECT
#>   df.*,
#>   Var_00 + Var_01 + Var_02 + Var_03 + Var_04 + Var_05 + Var_06 + Var_07 + Var_08 + Var_09 AS sum
#> FROM df
#> 
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             ID            │
#> │            date           │
#> │           Var_00          │
#> │           Var_01          │
#> │           Var_02          │
#> │           Var_03          │
#> │           Var_04          │
#> │           Var_05          │
#> │           Var_06          │
#> │           Var_07          │
#> │           Var_08          │
#> │           Var_09          │
#> │            sum            │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         SEQ_SCAN          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             df            │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             ID            │
#> │            date           │
#> │           Var_00          │
#> │           Var_01          │
#> │           Var_02          │
#> │           Var_03          │
#> │           Var_04          │
#> │           Var_05          │
#> │           Var_06          │
#> │           Var_07          │
#> │           Var_08          │
#> │           Var_09          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │           EC: 10          │
#> └───────────────────────────┘

waldo::compare(out, out_duck |> collect())
#> ✔ No differences

Created on 2024-09-10 with reprex v2.1.1

I'm having trouble using rowwise() or group_by() combined with across() in duckdb/dbplyr. When applying these functions, the results aren't as expected or errors occur. It seems there might be compatibility issues or specific syntax requirements when working with these packages.

In theory you could instead use pivot_longer() , summarize and join resulting sums to the main table:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

ddb_tbl <- arrow::to_duckdb(df)
sums_tbl <- 
  ddb_tbl |> 
  pivot_longer(cols = starts_with("Var_")) |> 
  summarise(sum = sum(value, na.rm = TRUE), .by = c(ID, date)) 
sums_tbl
#> # Source:   SQL [10 x 3]
#> # Database: DuckDB v1.0.0 [m@Windows 10 x64:R 4.4.1/:memory:]
#>    ID     date         sum
#>    <chr>  <date>     <dbl>
#>  1 000002 2003-03-22     5
#>  2 000004 2016-08-03     5
#>  3 000004 2004-05-22     5
#>  4 000001 2001-05-16     0
#>  5 000002 2016-07-14     0
#>  6 000003 2001-12-22     5
#>  7 000000 2022-02-16     0
#>  8 000000 1999-09-21     0
#>  9 000002 2001-09-27     5
#> 10 000004 2021-11-07     0

result <- left_join(ddb_tbl, sums_tbl, by = join_by(ID, date))

Results with:

result
#> # Source:   SQL [10 x 13]
#> # Database: DuckDB v1.0.0 [m@Windows 10 x64:R 4.4.1/:memory:]
#>    ID     date       Var_00 Var_01 Var_02 Var_03 Var_04 Var_05 Var_06 Var_07 Var_08 Var_09   sum
#>    <chr>  <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
#>  1 000000 2022-02-16      0      0      0      0      0      0      0      0      0      0     0
#>  2 000000 1999-09-21      0      0      0      0      0      0      0      0      0      0     0
#>  3 000001 2001-05-16      0      0      0      0      0      0      0      0      0      0     0
#>  4 000002 2016-07-14      0      0      0      0      0      0      0      0      0      0     0
#>  5 000002 2001-09-27      1      1      0      0      0      1      0      1      1      0     5
#>  6 000002 2003-03-22      1      1      0      0      0      1      0      1      1      0     5
#>  7 000003 2001-12-22      1      1      0      0      0      1      0      1      1      0     5
#>  8 000004 2004-05-22      1      1      0      0      0      1      0      1      1      0     5
#>  9 000004 2016-08-03      1      1      0      0      0      1      0      1      1      0     5
#> 10 000004 2021-11-07      0      0      0      0      0      0      0      0      0      0     0

Though from explain(result) I'm not really sure if/how this scales for your actual dataset. If performance is not acceptable, perhaps consider building sums_tbl subquery with UNPIVOT Dynamically Using Columns Expression.

Thanks @margusl!

This works, but as a quick heads up, if there are any NA values in the data, the sum column returns NA for all values.

As such, I would suggest the following edit, especially if using Varlist:

ddb_tbl <-  df |> mutate_at(Varlist, ~replace_na(.,0)) |> arrow::to_duckdb
sums_tbl <- 
  ddb_tbl |> 
  pivot_longer(cols = all_of(Varlist)) |> 
  summarise(sum = sum(value, na.rm = TRUE), .by = c(ID, date)) 
sums_tbl

Which seems to resolve the issue.

Interesting, I'm not able to reproduce this even if I fill one Var column and one row with NAs :

df$Var_00 <- NA
df[9,3:12] <- NA

Anyway, if NA / NULL replacement is needed, I'd rather let DuckDB handle this, after pivoting:

sums_tbl <- 
  ddb_tbl |> 
  pivot_longer(cols = starts_with("Var_")) |> 
  mutate(value = coalesce(value, 0L)) |> 
  summarise(sum = sum(value, na.rm = TRUE), .by = c(ID, date))
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.