[duckplyr/duckdb] Bad SQL call to dplyr with `rowwise/groupby()` and `across()`

Hi Posit users.

Having issues utilizing rowwise/group_by() with an across() call when employing duckdb/dbplyr

Specifically, given data like the following:

df <- structure(list(ID = c("000000", "000000", "000001", "000002", 
"000002", "000002", "000003", "000004", "000004", "000004"), 
    date = structure(c(19039, 10855, 11458, 16996, 11592, 12133, 
    11678, 12560, 17016, 18938), class = "Date"), Var_00 = c(0, 
    0, 0, 0, 1, 1, 1, 1, 1, 0), Var_01 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_02 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Var_03 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_04 = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), Var_05 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_06 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_07 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_08 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_09 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

I'd like to produce output like this:

dput(dfout)
structure(list(ID = c("000000", "000000", "000001", "000002", 
"000002", "000002", "000003", "000004", "000004", "000004"), 
    date = structure(c(19039, 10855, 11458, 16996, 11592, 12133, 
    11678, 12560, 17016, 18938), class = "Date"), Var_00 = c(0, 
    0, 0, 0, 1, 1, 1, 1, 1, 0), Var_01 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_02 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Var_03 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_04 = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), Var_05 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_06 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Var_07 = c(0, 0, 0, 0, 1, 1, 
    1, 1, 1, 0), Var_08 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0), Var_09 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), sum = c(0, 0, 0, 0, 5, 5, 5, 
    5, 5, 0)), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(.rows = structure(list(
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))

Where the sum column is the added total by row across all columns for a given row, using duckdb and duckplyr, as the actual data is on the order of a few million rows.

In standard dplyr, I would use a call like one of the following:

out <- df |> rowwise() |> mutate(sum = sum(across(all_of(Var_list))))

out <- df |> rowwise() |> mutate(sum = sum(across(starts_with("Var"))))

out <- df |> group_by(ID,date) |> mutate(sum = sum(across(all_of(Var_list))))

out <- df |> group_by(ID,date) |> mutate(sum = sum(across(starts_with("Var"))))

However, rowwise() doesn't seem to be implemented in duckplyr()

out <- df |> to_duckdb() |> rowwise() |> mutate(sum = sum(across(all_of(Var_list))))
Error in UseMethod("rowwise") : 
  no applicable method for 'rowwise' applied to an object of class "c('tbl_duckdb_connection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

out <- df |> to_duckdb() |> rowwise() |> mutate(sum = sum(across(starts_with("Var"))))
Error in UseMethod("rowwise") : 
  no applicable method for 'rowwise' applied to an object of class "c('tbl_duckdb_connection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

And using group_by() when the data is converted to a duckdb throws a SQL interpretation error I haven't quite understood the suggested conversion for:

> out <- df |> to_duckdb() |> group_by(ID,date) |> mutate(sum = sum(across(all_of(Var_list))))
> out
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
---
Backtrace:
     ▆
  1. ├─base (local) `<fn>`(x)
  2. ├─dbplyr:::print.tbl_sql(x)
  3. │ ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
  4. │ │ ├─base::cat(paste0(..., "\n"), sep = "")
  5. │ │ └─base::paste0(..., "\n")
  6. │ ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
  7. │ └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8. │   └─pillar:::format_tbl(...)
  9. │     └─pillar::tbl_format_setup(...)
 10. │       ├─pillar:::tbl_format_setup_dispatch(...)
 11. │       └─pillar:::tbl_format_setup.tbl(...)
 12. │         └─pillar:::df_head(x, n + 1)
 13. │           ├─base::as.data.frame(head(x, n))
 14. │           └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
 15. │             ├─base::as.data.frame(collect(x, n = n))
 16. │             ├─dplyr::collect(x, n = n)
 17. │             └─dbplyr:::collect.tbl_sql(x, n = n)
 18. │               └─dbplyr::db_sql_render(x$src$con, x, cte = cte)
 19. │                 ├─dbplyr::db_sql_render(con, sql, ..., sql_options = sql_options)
 20. │                 └─dbplyr:::db_sql_render.DBIConnection(con, sql, ..., sql_options = sql_options)
 21. │                   ├─dbplyr::sql_render(sql, con = con, ..., sql_options = sql_options)
 22. │                   └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., sql_options = sql_options)
 23. │                     ├─dbplyr::sql_render(...)
 24. │                     └─dbplyr:::sql_render.lazy_query(...)
 25. │                       ├─dbplyr::sql_build(query, con = con, sql_options = sql_options)
 26. │                       └─dbplyr:::sql_build.lazy_select_query(query, con = con, sql_options = sql_options)
 27. │                         └─dbplyr:::get_select_sql(...)
 28. │                           └─dbplyr:::translate_select_sql(con, select)
 29. │                             └─dbplyr::translate_sql_(...)
 30. │                               └─base::lapply(...)
 31. │                                 └─dbplyr (local) FUN(X[[i]], ...)
 32. │                                   ├─dbplyr::escape(eval_tidy(x, mask), con = con)
 33. │                                   └─rlang::eval_tidy(x, mask)
 34. ├─dbplyr (local) sum(`<named list>`)
 35. │ ├─dbplyr::win_over(...)
 36. │ │ └─dbplyr:::glue_sql2(con, "{.val expr} OVER {.val over}")
 37. │ │   ├─dbplyr::sql(...)
 38. │ │   │ └─dbplyr:::c_character(...)
 39. │ │   └─glue::glue(...)
 40. │ │     └─glue::glue_data(...)
 41. │ └─dbplyr:::glue_sql2(sql_current_con(), "{f}({.val x})")
 42. │   ├─dbplyr::sql(...)
 43. │   │ └─dbplyr:::c_character(...)
 44. │   └─glue::glue(...)
 45. │     └─glue::glue_data(...)
 46. ├─glue (local) `<fn>`(".val expr")
 47. │ ├─.transformer(expr, env) %||% .null
 48. │ └─dbplyr (local) .transformer(expr, env)
 49. │   └─base::eval(parse(text = value, keep.source = FALSE), envir)
 50. │     └─base::eval(parse(text = value, keep.source = FALSE), envir)
 51. └─glue (local) `<fn>`(".val x")
 52.   ├─.transformer(expr, env) %||% .null
 53.   └─dbplyr (local) .transformer(expr, env)
 54.     ├─dbplyr::escape(value, con = connection)
 55.     └─dbplyr:::escape.list(value, con = connection)
 56.       └─base::vapply(x, escape, character(1), con = con)
 57.         ├─dbplyr (local) FUN(X[[i]], ...)
 58.         └─dbplyr:::escape.default(X[[i]], ...)
 59.           └─dbplyr:::error_embed(obj_type_friendly(x), "x")
Run rlang::last_trace(drop = FALSE) to see 2 hidden frames.

(see below for hidden frame output of similar command)

> out <- df |> to_duckdb() |> group_by(ID,date) |> mutate(sum = sum(across(starts_with("Var"))))
> out
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
---
Backtrace:
     ▆
  1. ├─base (local) `<fn>`(x)
  2. ├─dbplyr:::print.tbl_sql(x)
  3. │ ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
  4. │ │ ├─base::cat(paste0(..., "\n"), sep = "")
  5. │ │ └─base::paste0(..., "\n")
  6. │ ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
  7. │ └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8. │   └─pillar:::format_tbl(...)
  9. │     └─pillar::tbl_format_setup(...)
 10. │       ├─pillar:::tbl_format_setup_dispatch(...)
 11. │       └─pillar:::tbl_format_setup.tbl(...)
 12. │         └─pillar:::df_head(x, n + 1)
 13. │           ├─base::as.data.frame(head(x, n))
 14. │           └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
 15. │             ├─base::as.data.frame(collect(x, n = n))
 16. │             ├─dplyr::collect(x, n = n)
 17. │             └─dbplyr:::collect.tbl_sql(x, n = n)
 18. │               └─dbplyr::db_sql_render(x$src$con, x, cte = cte)
 19. │                 ├─dbplyr::db_sql_render(con, sql, ..., sql_options = sql_options)
 20. │                 └─dbplyr:::db_sql_render.DBIConnection(con, sql, ..., sql_options = sql_options)
 21. │                   ├─dbplyr::sql_render(sql, con = con, ..., sql_options = sql_options)
 22. │                   └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., sql_options = sql_options)
 23. │                     ├─dbplyr::sql_render(...)
 24. │                     └─dbplyr:::sql_render.lazy_query(...)
 25. │                       ├─dbplyr::sql_build(query, con = con, sql_options = sql_options)
 26. │                       └─dbplyr:::sql_build.lazy_select_query(query, con = con, sql_options = sql_options)
 27. │                         └─dbplyr:::get_select_sql(...)
 28. │                           └─dbplyr:::translate_select_sql(con, select)
 29. │                             └─dbplyr::translate_sql_(...)
 30. │                               └─base::lapply(...)
 31. │                                 └─dbplyr (local) FUN(X[[i]], ...)
 32. │                                   ├─dbplyr::escape(eval_tidy(x, mask), con = con)
 33. │                                   └─rlang::eval_tidy(x, mask)
 34. ├─dbplyr (local) sum(`<named list>`)
 35. │ ├─dbplyr::win_over(...)
 36. │ │ └─dbplyr:::glue_sql2(con, "{.val expr} OVER {.val over}")
 37. │ │   ├─dbplyr::sql(...)
 38. │ │   │ └─dbplyr:::c_character(...)
 39. │ │   └─glue::glue(...)
 40. │ │     └─glue::glue_data(...)
 41. │ └─dbplyr:::glue_sql2(sql_current_con(), "{f}({.val x})")
 42. │   ├─dbplyr::sql(...)
 43. │   │ └─dbplyr:::c_character(...)
 44. │   └─glue::glue(...)
 45. │     └─glue::glue_data(...)
 46. ├─glue (local) `<fn>`(".val expr")
 47. │ ├─.transformer(expr, env) %||% .null
 48. │ └─dbplyr (local) .transformer(expr, env)
 49. │   └─base::eval(parse(text = value, keep.source = FALSE), envir)
 50. │     └─base::eval(parse(text = value, keep.source = FALSE), envir)
 51. └─glue (local) `<fn>`(".val x")
 52.   ├─.transformer(expr, env) %||% .null
 53.   └─dbplyr (local) .transformer(expr, env)
 54.     ├─dbplyr::escape(value, con = connection)
 55.     └─dbplyr:::escape.list(value, con = connection)
 56.       └─base::vapply(x, escape, character(1), con = con)
 57.         ├─dbplyr (local) FUN(X[[i]], ...)
 58.         └─dbplyr:::escape.default(X[[i]], ...)
 59.           └─dbplyr:::error_embed(obj_type_friendly(x), "x")
Run rlang::last_trace(drop = FALSE) to see 2 hidden frames.
> rlang::last_trace(drop = FALSE)
<error/rlang_error>
Error:
! Cannot translate a symbol to SQL.
ℹ Do you want to force evaluation in R with (e.g.) `!!x` or `local(x)`?
---
Backtrace:
     ▆
  1. ├─base (local) `<fn>`(x)
  2. ├─dbplyr:::print.tbl_sql(x)
  3. │ ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
  4. │ │ ├─base::cat(paste0(..., "\n"), sep = "")
  5. │ │ └─base::paste0(..., "\n")
  6. │ ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
  7. │ └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8. │   └─pillar:::format_tbl(...)
  9. │     └─pillar::tbl_format_setup(...)
 10. │       ├─pillar:::tbl_format_setup_dispatch(...)
 11. │       └─pillar:::tbl_format_setup.tbl(...)
 12. │         └─pillar:::df_head(x, n + 1)
 13. │           ├─base::as.data.frame(head(x, n))
 14. │           └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
 15. │             ├─base::as.data.frame(collect(x, n = n))
 16. │             ├─dplyr::collect(x, n = n)
 17. │             └─dbplyr:::collect.tbl_sql(x, n = n)
 18. │               └─dbplyr::db_sql_render(x$src$con, x, cte = cte)
 19. │                 ├─dbplyr::db_sql_render(con, sql, ..., sql_options = sql_options)
 20. │                 └─dbplyr:::db_sql_render.DBIConnection(con, sql, ..., sql_options = sql_options)
 21. │                   ├─dbplyr::sql_render(sql, con = con, ..., sql_options = sql_options)
 22. │                   └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., sql_options = sql_options)
 23. │                     ├─dbplyr::sql_render(...)
 24. │                     └─dbplyr:::sql_render.lazy_query(...)
 25. │                       ├─dbplyr::sql_build(query, con = con, sql_options = sql_options)
 26. │                       └─dbplyr:::sql_build.lazy_select_query(query, con = con, sql_options = sql_options)
 27. │                         └─dbplyr:::get_select_sql(...)
 28. │                           └─dbplyr:::translate_select_sql(con, select)
 29. │                             └─dbplyr::translate_sql_(...)
 30. │                               └─base::lapply(...)
 31. │                                 └─dbplyr (local) FUN(X[[i]], ...)
 32. │                                   ├─dbplyr::escape(eval_tidy(x, mask), con = con)
 33. │                                   └─rlang::eval_tidy(x, mask)
 34. ├─dbplyr (local) sum(`<named list>`)
 35. │ ├─dbplyr::win_over(...)
 36. │ │ └─dbplyr:::glue_sql2(con, "{.val expr} OVER {.val over}")
 37. │ │   ├─dbplyr::sql(...)
 38. │ │   │ └─dbplyr:::c_character(...)
 39. │ │   └─glue::glue(...)
 40. │ │     └─glue::glue_data(...)
 41. │ └─dbplyr:::glue_sql2(sql_current_con(), "{f}({.val x})")
 42. │   ├─dbplyr::sql(...)
 43. │   │ └─dbplyr:::c_character(...)
 44. │   └─glue::glue(...)
 45. │     └─glue::glue_data(...)
 46. ├─glue (local) `<fn>`(".val expr")
 47. │ ├─.transformer(expr, env) %||% .null
 48. │ └─dbplyr (local) .transformer(expr, env)
 49. │   └─base::eval(parse(text = value, keep.source = FALSE), envir)
 50. │     └─base::eval(parse(text = value, keep.source = FALSE), envir)
 51. └─glue (local) `<fn>`(".val x")
 52.   ├─.transformer(expr, env) %||% .null
 53.   └─dbplyr (local) .transformer(expr, env)
 54.     ├─dbplyr::escape(value, con = connection)
 55.     └─dbplyr:::escape.list(value, con = connection)
 56.       └─base::vapply(x, escape, character(1), con = con)
 57.         ├─dbplyr (local) FUN(X[[i]], ...)
 58.         └─dbplyr:::escape.default(X[[i]], ...)
 59.           └─dbplyr:::error_embed(obj_type_friendly(x), "x")
 60.             └─cli::cli_abort(...)
 61.               └─rlang::abort(...)

I'm assuming this is some sort of R to SQL translation on the backend that's getting lost that can be fixed with a direct call to one or the other in some way, but it isn't intuitive to me how to force evaluation in R with (e.g.) `!!x` or `local(x)`? as it were.

Would you happen to have any suggestions or an example as to how I could perform or get around this?

Thank you in advance!

Edit:
Forgot loaded packages.
Apologies

microbenchmark_1.4.10
profmem_0.6.0
fastDummies_1.7.3
glue_1.7.0
data.table_1.15.4
janitor_2.2.0
readxl_1.4.3
reticulate_1.38.0
parquetize_0.5.7
duckplyr_0.4.1
duckdb_1.0.0-2
DBI_1.2.3
arrow_16.1.0
haven_2.5.4
fs_1.6.4
lubridate_1.9.3
forcats_1.0.0
stringr_1.5.1
dplyr_1.1.4
purrr_1.0.2
readr_2.1.5
tidyr_1.3.1
tibble_3.2.1
ggplot2_3.5.1
tidyverse_2.0.0

I wasn't sure which packages were being used here. to_duckdb comes from the arrow package? In any case, my solution here focuses on using dbplyr with duckdb as I haven't yet worked with duckplyr. I also changed your data slightly to make sure it was working as intended. First, I don't think you need to use any grouping for what you want to accomplish - if your actual example differs that much then it should be easy to include. You can use rowSums or Reduce to accomplish what you need in-memory (no database), and then taking the idea of Reduce we can make an equivalent sql statement. Reduce can't be translated to SQL using dbplyr but we can generate a "Var_01+Var_02+...." string and plug that in the sql function to hopefully get what you need.

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Data
df <- structure(
  list(
    ID = c(
      "000000",
      "000000",
      "000001",
      "000002",
      "000002",
      "000002",
      "000003",
      "000004",
      "000004",
      "000004"
    ),
    date = structure(
      c(
        19039,
        10855,
        11458,
        16996,
        11592,
        12133,
        11678,
        12560,
        17016,
        18938
      ),
      class = "Date"
    ),
    Var_00 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0),
    Var_01 = c(0, 0, 0, 0, 1, 1, 0, 1, 1, 0),
    Var_02 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Var_03 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0),
    Var_04 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Var_05 = c(0, 0, 1, 0, 1, 1, 0, 1, 1, 0),
    Var_06 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
    Var_07 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0),
    Var_08 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0),
    Var_09 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
  ),
  row.names = c(NA,-10L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

Var_list = c(
  "Var_00",
  "Var_01",
  "Var_02",
  "Var_03",
  "Var_04",
  "Var_05",
  "Var_06",
  "Var_07",
  "Var_08",
  "Var_09"
)

# Desired output
out = df |> 
  rowwise() |> 
  mutate(sum = sum(across(all_of(Var_list)))) |> 
  ungroup()
print(out)
#> # A tibble: 10 × 13
#>    ID     date       Var_00 Var_01 Var_02 Var_03 Var_04 Var_05 Var_06 Var_07
#>    <chr>  <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 000000 2022-02-16      0      0      0      0      0      0      0      0
#>  2 000000 1999-09-21      0      0      0      0      0      0      0      0
#>  3 000001 2001-05-16      0      0      0      1      0      1      0      0
#>  4 000002 2016-07-14      0      0      0      0      0      0      0      0
#>  5 000002 2001-09-27      1      1      0      0      0      1      0      1
#>  6 000002 2003-03-22      1      1      0      0      0      1      0      1
#>  7 000003 2001-12-22      1      0      0      0      0      0      0      1
#>  8 000004 2004-05-22      1      1      0      0      0      1      0      1
#>  9 000004 2016-08-03      1      1      0      0      0      1      0      1
#> 10 000004 2021-11-07      0      0      0      0      0      0      0      0
#> # ℹ 3 more variables: Var_08 <dbl>, Var_09 <dbl>, sum <dbl>

# But no grouping is needed
## rowSums
out_nogroup = df |> 
  mutate(sum = rowSums(across(all_of(Var_list))))

## Reduce
out_reduce = df |> 
  mutate(sum = Reduce( `+`, across(all_of(Var_list))))

waldo::compare(out, out_nogroup) # same
#> ✔ No differences
waldo::compare(out, out_reduce) # same
#> ✔ No differences


# Use duckdb and dbplyr
con = dbConnect(duckdb())
dbWriteTable(con, name = "df", df)

Var_list_expr = paste0(Var_list, collapse = " + ")
print(Var_list_expr)
#> [1] "Var_00 + Var_01 + Var_02 + Var_03 + Var_04 + Var_05 + Var_06 + Var_07 + Var_08 + Var_09"

out_duck = tbl(con, "df") |> 
  mutate(sum = sql(Var_list_expr))
class(out_duck)
#> [1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
#> [4] "tbl_lazy"              "tbl"
out_duck |> explain() # plan
#> <SQL>
#> SELECT
#>   df.*,
#>   Var_00 + Var_01 + Var_02 + Var_03 + Var_04 + Var_05 + Var_06 + Var_07 + Var_08 + Var_09 AS sum
#> FROM df
#> 
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             ID            │
#> │            date           │
#> │           Var_00          │
#> │           Var_01          │
#> │           Var_02          │
#> │           Var_03          │
#> │           Var_04          │
#> │           Var_05          │
#> │           Var_06          │
#> │           Var_07          │
#> │           Var_08          │
#> │           Var_09          │
#> │            sum            │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         SEQ_SCAN          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             df            │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             ID            │
#> │            date           │
#> │           Var_00          │
#> │           Var_01          │
#> │           Var_02          │
#> │           Var_03          │
#> │           Var_04          │
#> │           Var_05          │
#> │           Var_06          │
#> │           Var_07          │
#> │           Var_08          │
#> │           Var_09          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │           EC: 10          │
#> └───────────────────────────┘

waldo::compare(out, out_duck |> collect())
#> ✔ No differences

Created on 2024-09-10 with reprex v2.1.1

In theory you could instead use pivot_longer() , summarize and join resulting sums to the main table:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

ddb_tbl <- arrow::to_duckdb(df)
sums_tbl <- 
  ddb_tbl |> 
  pivot_longer(cols = starts_with("Var_")) |> 
  summarise(sum = sum(value, na.rm = TRUE), .by = c(ID, date)) 
sums_tbl
#> # Source:   SQL [10 x 3]
#> # Database: DuckDB v1.0.0 [m@Windows 10 x64:R 4.4.1/:memory:]
#>    ID     date         sum
#>    <chr>  <date>     <dbl>
#>  1 000002 2003-03-22     5
#>  2 000004 2016-08-03     5
#>  3 000004 2004-05-22     5
#>  4 000001 2001-05-16     0
#>  5 000002 2016-07-14     0
#>  6 000003 2001-12-22     5
#>  7 000000 2022-02-16     0
#>  8 000000 1999-09-21     0
#>  9 000002 2001-09-27     5
#> 10 000004 2021-11-07     0

result <- left_join(ddb_tbl, sums_tbl, by = join_by(ID, date))

Results with:

result
#> # Source:   SQL [10 x 13]
#> # Database: DuckDB v1.0.0 [m@Windows 10 x64:R 4.4.1/:memory:]
#>    ID     date       Var_00 Var_01 Var_02 Var_03 Var_04 Var_05 Var_06 Var_07 Var_08 Var_09   sum
#>    <chr>  <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
#>  1 000000 2022-02-16      0      0      0      0      0      0      0      0      0      0     0
#>  2 000000 1999-09-21      0      0      0      0      0      0      0      0      0      0     0
#>  3 000001 2001-05-16      0      0      0      0      0      0      0      0      0      0     0
#>  4 000002 2016-07-14      0      0      0      0      0      0      0      0      0      0     0
#>  5 000002 2001-09-27      1      1      0      0      0      1      0      1      1      0     5
#>  6 000002 2003-03-22      1      1      0      0      0      1      0      1      1      0     5
#>  7 000003 2001-12-22      1      1      0      0      0      1      0      1      1      0     5
#>  8 000004 2004-05-22      1      1      0      0      0      1      0      1      1      0     5
#>  9 000004 2016-08-03      1      1      0      0      0      1      0      1      1      0     5
#> 10 000004 2021-11-07      0      0      0      0      0      0      0      0      0      0     0

Though from explain(result) I'm not really sure if/how this scales for your actual dataset. If performance is not acceptable, perhaps consider building sums_tbl subquery with UNPIVOT Dynamically Using Columns Expression.

Thanks @margusl!

This works, but as a quick heads up, if there are any NA values in the data, the sum column returns NA for all values.

As such, I would suggest the following edit, especially if using Varlist:

ddb_tbl <-  df |> mutate_at(Varlist, ~replace_na(.,0)) |> arrow::to_duckdb
sums_tbl <- 
  ddb_tbl |> 
  pivot_longer(cols = all_of(Varlist)) |> 
  summarise(sum = sum(value, na.rm = TRUE), .by = c(ID, date)) 
sums_tbl

Which seems to resolve the issue.

Interesting, I'm not able to reproduce this even if I fill one Var column and one row with NAs :

df$Var_00 <- NA
df[9,3:12] <- NA

Anyway, if NA / NULL replacement is needed, I'd rather let DuckDB handle this, after pivoting:

sums_tbl <- 
  ddb_tbl |> 
  pivot_longer(cols = starts_with("Var_")) |> 
  mutate(value = coalesce(value, 0L)) |> 
  summarise(sum = sum(value, na.rm = TRUE), .by = c(ID, date))
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.