I'm trying to compute the correlation matrix of the columns of a duckdb table loaded using tibble. According to github repo, that function supports dbplyr backend, but it seems to be loading the dataset entirely in memory. Even though, that isn't a problem for my use case, duckdb uses parallel executien by default whereas computing it in R forces using a single thread.
Is this a consequence of using duckdb? Does it just support spark databases as backend?
Hi @davidelahoz , welcome to Community. There is no step in corrr
that loads all of the data into R, it only pulls the results from the database. Do you have any code I can use to try and recreate?
Yes, the code is pretty basic, I'm just using correlate from corrr:
# create / connect to database file
drv <- duckdb(dbdir = "./features.duckdb")
con <- dbConnect(drv)
# Load db table as tibble
features_db <- tbl(con, "Features")
# Prepare data for analysis
data <- features_db |>
slice_sample(n = 100) |>
mutate(
target = as.logical(Category)
) |> select (!Category)
# Get correlations
data |>
select(-target) |>
correlate()
The table that it loads from the database consists in a column called Category that can contain a 0 (false) or a 1 (true) and 5000 columns containing numerical data. After a long execution, it seems that the dataset is loaded initially entirely to memory for some R operations and then duckdb gets called. Probably it's a consequence of something I'm doing wrong
could it be that you didnt load dbplyr
?