Does correlate() from corrr support in db computation?

davidelahoz · November 13, 2024, 10:32am

I'm trying to compute the correlation matrix of the columns of a duckdb table loaded using tibble. According to github repo, that function supports dbplyr backend, but it seems to be loading the dataset entirely in memory. Even though, that isn't a problem for my use case, duckdb uses parallel executien by default whereas computing it in R forces using a single thread.

Is this a consequence of using duckdb? Does it just support spark databases as backend?

edgararuiz · November 13, 2024, 3:24pm

Hi @davidelahoz , welcome to Community. There is no step in corrr that loads all of the data into R, it only pulls the results from the database. Do you have any code I can use to try and recreate?

davidelahoz · November 14, 2024, 10:19am

Yes, the code is pretty basic, I'm just using correlate from corrr:

# create / connect to database file
drv <- duckdb(dbdir = "./features.duckdb")
con <- dbConnect(drv)

# Load db table as tibble
features_db <- tbl(con, "Features")

# Prepare data for analysis
data <- features_db |>
  slice_sample(n = 100) |>
  mutate(
    target = as.logical(Category)
  ) |> select (!Category)

# Get correlations
data |>
  select(-target) |>
  correlate()

The table that it loads from the database consists in a column called Category that can contain a 0 (false) or a 1 (true) and 5000 columns containing numerical data. After a long execution, it seems that the dataset is loaded initially entirely to memory for some R operations and then duckdb gets called. Probably it's a consequence of something I'm doing wrong

nirgrahamuk · November 14, 2024, 11:44am

could it be that you didnt load dbplyr ?

system · February 12, 2025, 11:45am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.