Hey - struggling with this one and couldn't find any obvious stackoverflow examples answering the question - I'm sure I'm missing an obvious argument step here, but any help would be appreciated!
I have ~60GB of csv weather data scattered across 366 folders containing ~100 files each of about 1-2MB, so roughly 36,600 files. Way too much to keep in memory! So, I thought - let's iterate over all these files and load them into a local DB like duckdb
or RSQLite
and query results from there! If the results aren't in memory, we should be OK, right?
So, did something like this:
library(readr)
library(DBI)
library(duckdb)
library(dbplyr)
con <- dbConnect(duckdb::duckdb(), dbname = "Output-Files/WX-DB.duckdb")
wx_file <- read_csv("Input-Data/directory/20220101/11052.csv")
wx_spec <- spec(wx_file)
wx_colnames <- colnames(wx_file)
copy_to(con, wx_file)
wx_file_remote <- tbl(con, "wx_file")
rows_delete(x = wx_file_remote, y = wx_file, copy = TRUE, unmatched = "ignore")
The theory here is we instantiate a local database, read in one of the files as a sample to get the spec and column names, copy it over to get the schema and then delete all the rows so we don't need to ignore the sample file in the full load.
Unfortunately, something weird happens here - the DB file doesn't grow! If I call:
dbDisconnect(conn = con)
con <- dbConnect(duckdb::duckdb(), dbname = "Output-Files/WX-DB.duckdb")
wx_file_remote <- tbl(con, "wx_file")
I get:
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT *
FROM wx_file AS q02
WHERE (0 = 1)
Error: Catalog Error: Table with name wx_file does not exist!
Did you mean "pg_type"?
LINE 2: FROM wx_file AS q02
Where'd my table go! I get similar results with RSQLite:
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! no such table: wx_file
Run `rlang::last_error()` to see where the error occurred.
As far as I can tell, even though I (think) I'm asking for the DB to not be created in-memory, it's doing it anyway, and I can't find any consistent way of making it not do that. That's confirmed by doing the following:
library(purrr)
library(dplyr)
library(stringr)
dir <- dir(path = "Input-Data/directory/")
purrr::walk(dir, ~ {
files <- dir(path = str_c("Input-Data/directory/", .x, "/", sep = "")) %>% str_c("Input-Data/directory/", .x, "/", ., sep = "")
tmp <- read_csv(files, col_types = wx_spec, col_names = wx_colnames, num_threads = 1, lazy = TRUE)
rows_append(wx_file_remote, tmp, in_place = TRUE, copy = TRUE)
rm(tmp)
gc()
}, .progress = TRUE)
Which quickly annihilates the relatively little local RAM I have.
What am I missing here? Real "feel like I'm taking crazy pills" situation. Sure I'm missing something obvious.