Deterministic random number generation in duckdb with dplyr syntax

Cross-posted from StackOverflow: r - Deterministic random number generation in duckdb with dplyr syntax - Stack Overflow

How can I use duckdb's setseed() function (see reference doc) with dplyr syntax to make sure the analysis below is reproducible?

# dplyr version 1.1.1
# arrow version 11.0.0.3
# duckdb 0.7.1.1
out_dir <- tempfile()
arrow::write_dataset(mtcars, out_dir, partitioning = "cyl")

mtcars_ds <- arrow::open_dataset(out_dir)

mtcars_smry <- mtcars_ds |>
  arrow::to_duckdb() |>
  dplyr::mutate(
    fold = ceiling(3 * random())
  ) |>
  dplyr::summarize(
    avg_hp = mean(hp),
    .by = c(cyl, fold)
  )

mtcars_smry |>
  dplyr::collect()
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 9 × 3
#>     cyl  fold avg_hp
#>   <int> <dbl>  <dbl>
#> 1     4     1   92  
#> 2     4     3   82.3
#> 3     4     2   74.5
#> 4     8     2  183. 
#> 5     8     3  210  
#> 6     8     1  300. 
#> 7     6     3  110  
#> 8     6     1  117  
#> 9     6     2  175

Created on 2023-08-27 with reprex v2.0.2

Is there a reason you have not accepted r2evans solution posted yesterday ?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.