I have prepared following script. It creates a sample dataset where you can adjust n_observations
and n_names
to create a dataset that is bigger than your ram which is then written to disk. At the end, it tries to write data into a table which is obviously not possible because it is not loaded into RAM. collect
ing data
is obviously not possible.
What's the best method to chunk the dataset?
A naive method would be to loop over the groups. What if the groups aren't as obvious as in this case?
library("tidyverse")
library("glue")
library("stringi")
library("arrow")
library("duckdb")
#'
#' # Create Sample Dataset
#'
sample_data <- function(n) {
data <- tibble(
datetime = ymd("1970-01-01") + milliseconds(c(0:(n - 1))),
name = stri_rand_strings(1, 12),
value = rnorm(n) |> cumsum()
)
return(data)
}
set.seed(1)
n_observations <- 1e3
n_names <- 1e3
data_path <- glue(tempdir(), "/data")
dir.create(data_path)
for (i in 1:n_names) {
data <- sample_data(n_observations)
data |>
group_by(name) |>
write_dataset(data_path)
}
directory_is_empty <- data_path |>
list.files() |>
length() |>
is_empty()
directory_is_empty
#'
#' # Load Data into Database
#'
data <- open_dataset(data_path)
# > data |> to_duckdb()
# # Source: table<arrow_003> [?? x 3]
# datetime value name
# <dttm> <dbl> <chr>
# 1 1970-01-01 00:00:00.000 -1.76 1EFxw3QgJNX6
# 2 1970-01-01 00:00:00.001 -1.79 1EFxw3QgJNX6
# 3 1970-01-01 00:00:00.002 -0.978 1EFxw3QgJNX6
# 4 1970-01-01 00:00:00.003 -0.532 1EFxw3QgJNX6
# 5 1970-01-01 00:00:00.004 0.364 1EFxw3QgJNX6
# 6 1970-01-01 00:00:00.005 2.21 1EFxw3QgJNX6
# 7 1970-01-01 00:00:00.006 4.39 1EFxw3QgJNX6
# 8 1970-01-01 00:00:00.007 3.02 1EFxw3QgJNX6
# 9 1970-01-01 00:00:00.008 5.10 1EFxw3QgJNX6
# 10 1970-01-01 00:00:00.009 6.64 1EFxw3QgJNX6
# # ℹ more rows
# # ℹ Use `print(n = ...)` to see more rows
data |>
group_by(name) |>
count()
con <- dbConnect(duckdb(), dbdir = "test.duckdb")
con |>
dbWriteTable(
"data",
data,
append = FALSE,
overwrite = TRUE
)
# Error in (function (classes, fdef, mtable) :
# unable to find an inherited method for function 'dbWriteTable' for signature '"duckdb_connection", "character", "FileSystemDataset"'