Large model sizes (including training data)

Oscar1 · April 9, 2025, 5:00am

Hi,
I really like tidymodels But recently, my models have increased in size as a function of training data, which i would like to avoid. to avoid this my current code have fitted the model in a separate environment.

in glm you can use y = FALSE and model = FALSE to ensure data is not saved with the model.

glm(y ~ x, data = d, family = binomial(), model = FALSE, y = FALSE)

However, if i understand it correctly, these parameters cannot be passed to glm when using tidymodels, right?

To reduce the model size, I have also tried removing parts of the object before saving (but it seems to pull them back from the environment before saving them or break the predictive ability of the models).

so, how can i ensure that models created using tidymodels does not save any training data and stay the same size irrespective of training data size?

I use this code, and it did work some time ago but now saves inflated models.

I have tried adding parsnip::set_engine("lm"**, y = FALSE, model = FALSE**) and parsnip::fit(wf_final, data = xy_all, y = FALSE, model = FALSE**) with no luck.

  model_save_small_size <- function(xy_all, final_recipe, penalty, mixture, model, nr_predictors) {
    env_final_model <- new.env(parent = globalenv())
    env_final_model$xy_all <- xy_all
    env_final_model$final_recipe <- final_recipe
    env_final_model$penalty_mode <- statisticalMode(penalty)
    env_final_model$mixture_mode <- statisticalMode(mixture)
    env_final_model$model <- model
    env_final_model$nr_predictors <- nr_predictors
    env_final_model$statisticalMode <- statisticalMode
    env_final_model$`%>%` <- `%>%`

    final_predictive_model <- with(env_final_model, {
      if (nr_predictors > 3) {
        final_predictive_model_spec <-
          if (model == "regression") {
            parsnip::linear_reg(penalty = penalty_mode, mixture = mixture_mode)
          } else if (model == "logistic") {
            parsnip::logistic_reg(mode = "classification", penalty = penalty_mode, mixture = mixture_mode)
          } else if (model == "multinomial") {
            parsnip::multinom_reg(mode = "classification", penalty = penalty_mode, mixture = mixture_mode)
          }

        final_predictive_model_spec <- final_predictive_model_spec %>%
          parsnip::set_engine("glmnet")

        # Create Workflow (to know variable roles from recipes) help(workflow)
        wf_final <- workflows::workflow() %>%
          workflows::add_model(final_predictive_model_spec) %>%
          workflows::add_recipe(final_recipe[[1]])

        parsnip::fit(wf_final, data = xy_all)
      } else if (nr_predictors == 3) {
        final_predictive_model_spec <-
          if (model == "regression") {
            parsnip::linear_reg(mode = "regression") %>%
              parsnip::set_engine("lm")
          } else if (model == "logistic") {
            parsnip::logistic_reg(mode = "classification") %>%
              parsnip::set_engine("glm")
          } else if (model == "multinomial") {
            parsnip::multinom_reg(mode = "classification") %>%
              parsnip::set_engine("glmnet")
          }

        wf_final <- workflows::workflow() %>%
          workflows::add_model(final_predictive_model_spec) %>%
          workflows::add_recipe(final_recipe[[1]])

        ### parsnip::fit(wf_final, data = xy_all)
        parsnip::fit(wf_final, data = xy_all)
      }
    })
    remove("final_recipe", envir = env_final_model)
    remove("xy_all", envir = env_final_model)
    return(final_predictive_model)
  }

Any help is much appreciated.

Max · April 9, 2025, 11:13am

Yes, you can pass things like x = FALSE to glm() via set_engine(). There's an example below.

The good news is that the butcher package is designed to remove everything that is not required for prediction. As you'll see below, that helps, but the model object is still pretty large.

The bad news is that this model's QR decomposition will grow and become really large, so there is no way to significantly reduce its size. For this data set, there are many dummy variable codes, so there's no way to have a small QR object (it's much worse if you include the ZIP code in the model).

There's more good news though... the orbital package can translate this to SQL, and you can use that (in R or a DB). It can't be used for any model (KNN, for example); see this page. One other thing... if your model uses fewer predictors than the original set (such as with a tree or glmnet), you don't need those for prediction. It also works if you have a recipe (for supported steps). The SQL is optimized to be very small.

It's a 165,059-fold reduction in size!

Here's a reprex for the whole thing:

library(tidymodels)
library(butcher)
library(lobstr)
library(orbital)
library(tidypredict)

bigger_houses <- Sacramento[rep(1:nrow(Sacramento), 10^3),]
obj_size(bigger_houses)
#> 44.75 MB

# Baseline
glm_fit_1 <- 
  linear_reg() %>% 
  fit(price ~ . -  zip, data = bigger_houses)

obj_size(glm_fit_1)
#> 447.41 MB
# What's taking up memory? 
weigh(glm_fit_1)
#> # A tibble: 29 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 qr.qr           328.  
#>  2 terms            44.8 
#>  3 call             44.8 
#>  4 effects          14.9 
#>  5 residuals         7.46
#>  6 fitted.values     7.46
#>  7 model.baths       7.46
#>  8 model.latitude    7.46
#>  9 model.longitude   7.46
#> 10 model.zip         3.73
#> # ℹ 19 more rows

# Removing x and y
glm_fit_2 <- 
  linear_reg() %>% 
  set_engine("glm", x = FALSE, y = FALSE, model = FALSE) %>% 
  fit(price ~ . -  zip, data = bigger_houses)

# I have _no_ idea why this is slightly larger
obj_size(glm_fit_2)
#> 428.82 MB
weigh(glm_fit_2)
#> # A tibble: 56 × 2
#>    object              size
#>    <chr>              <dbl>
#>  1 qr.qr             332.  
#>  2 terms              44.8 
#>  3 call               44.8 
#>  4 formula            44.8 
#>  5 effects            14.9 
#>  6 residuals           7.46
#>  7 fitted.values       7.46
#>  8 linear.predictors   7.46
#>  9 weights             7.46
#> 10 prior.weights       7.46
#> # ℹ 46 more rows

# Remove everything not required for prediction
glm_fit_3 <- butcher(glm_fit_1)

obj_size(glm_fit_3)
#> 395.22 MB
weigh(glm_fit_3)
#> # A tibble: 29 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 qr.qr           328.  
#>  2 effects          14.9 
#>  3 residuals         7.46
#>  4 model.baths       7.46
#>  5 model.latitude    7.46
#>  6 model.longitude   7.46
#>  7 model.zip         3.73
#>  8 model.city        3.73
#>  9 model.type        3.73
#> 10 model.price       3.73
#> # ℹ 19 more rows

# Move it to SQL

glm_fit_4 <- orbital(glm_fit_1)
obj_size(glm_fit_4)
#> 3.07 kB

# Just checking
predict(glm_fit_4, head(bigger_houses))
#> # A tibble: 6 × 1
#>     .pred
#>     <dbl>
#> 1 141841.
#> 2 156530.
#> 3 136472.
#> 4 143465.
#> 5 131498.
#> 6 109337.

# Reduction in size:
as.numeric(obj_size(glm_fit_1) / obj_size(glm_fit_4))
#> [1] 165059

^{Created on 2025-04-09 with reprex v2.1.1}

Oscar1 · April 9, 2025, 11:35am

Thanks a million for this
(also, at the moment my biggest concern is not the size per se, but its a data security/safety concern not wanting training data to be somewhere in the object when sharing models openly).

Max · April 9, 2025, 12:44pm

That's a great point that I had not thought about

system · April 16, 2025, 12:45pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.