Cross-posted from StackOverflow: r - Deterministic random number generation in duckdb with dplyr syntax - Stack Overflow
How can I use duckdb's setseed()
function (see reference doc) with dplyr syntax to make sure the analysis below is reproducible?
# dplyr version 1.1.1
# arrow version 11.0.0.3
# duckdb 0.7.1.1
out_dir <- tempfile()
arrow::write_dataset(mtcars, out_dir, partitioning = "cyl")
mtcars_ds <- arrow::open_dataset(out_dir)
mtcars_smry <- mtcars_ds |>
arrow::to_duckdb() |>
dplyr::mutate(
fold = ceiling(3 * random())
) |>
dplyr::summarize(
avg_hp = mean(hp),
.by = c(cyl, fold)
)
mtcars_smry |>
dplyr::collect()
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 9 × 3
#> cyl fold avg_hp
#> <int> <dbl> <dbl>
#> 1 4 1 92
#> 2 4 3 82.3
#> 3 4 2 74.5
#> 4 8 2 183.
#> 5 8 3 210
#> 6 8 1 300.
#> 7 6 3 110
#> 8 6 1 117
#> 9 6 2 175
Created on 2023-08-27 with reprex v2.0.2