I'm trying to use tidymodels to perform a logistic regression on a dataset with ~2.5million rows and <10 predictors. I'm running into some memory issues, and I'm not sure how to understand what's going on. Each record has an ID, which is a string. If I remove this from the dataset before creating the model it works fine (df_1
in the example). If I include it (df_2
) I get this massive memory error, even though I'm just using it as an id.
Also, if I build the formula manually, (outcome ~ pred_fct_1 + pred_fct_2 + ...
) it works, regardless of whether the id is a number or a string.
nr_records <- 2.5e6
df_1 <- tibble(
id = 1:nr_records,
outcome = factor(if_else(runif(nr_records) > 0.9, "Y", "N")),
pred_fct_1 = factor(if_else(runif(nr_records) > 0.9, "Y", "N")),
pred_fct_2 = factor(if_else(runif(nr_records) > 0.7, "Y", "N")),
pred_fct_3 = factor(if_else(runif(nr_records) > 0.9, "Y", "N")),
pred_fct_4 = factor(if_else(runif(nr_records) > 0.7, "Y", "N")),
pred_fct_5 = factor(if_else(runif(nr_records) > 0.9, "Y", "N")),
pred_fct_6 = factor(if_else(runif(nr_records) > 0.7, "Y", "N")),
pred_dbl_1 = runif(nr_records),
pred_dbl_2 = runif(nr_records),
pred_dbl_3 = runif(nr_records)
df_2 <- df_1 %>%
id = stringi::stri_rand_strings(nr_records, 8, pattern = "[A-Z0-9]")
fit_data <- function(df) {
my_recipe <- recipe(outcome ~ ., data = df) %>%
update_role(id, new_role = "ID") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
lr_mod <- logistic_reg() %>%
wflow <- workflow() %>%
add_model(lr_mod) %>%
fit <- wflow %>%
fit(data = df)
fit_1 <- fit_data(df_1)
fit_2 <- fit_data(df_2)
#> Error: cannot allocate vector of size 46566.1 Gb
