Memory issues when running workflowsets

markhwhiteii · August 22, 2024, 7:37pm

Apologies for the lack of reproducible example—I'm using proprietary data, working in a Linux cluster, and the issues I'm running into only occur after hours of the code running.

I have a dataset that is only 206k rows long and about 170 columns wide when factors are one-hot encoded (although I stopped doing this for tree-based methods, as I've seen suggested elsewhere for tidymodels).

I'm on a Linux cluster that has 128G of memory to use. I've simplified it to running 3 algorithms (elastic net, random forest, XGBoost) on 5-fold cross-validation with a tune grid the size of 30. So that would be 3 x 5 x 30 = 450 models.

I define a function to make a workflowset, run cross-validation, and then I have script that runs it and pulls out only the information I'm interested in: The values of the hyperparameters and their metrics:

do_wfs <- function(dat_train) {
  
  # candidate models -----------------------------------------------------------
  spec_elnet <- logistic_reg(
    engine = "glmnet", 
    penalty = tune(), 
    mixture = tune()
  )

  spec_rf <- rand_forest(
    mode = "classification",
    engine = "ranger",
    trees = 500,
    mtry = tune(),
    min_n = tune()
  )
  
  spec_xgb <- boost_tree(
    mode = "classification",
    engine = "xgboost",
    tree_depth = tune(),
    trees = 15,
    learn_rate = tune(),
    mtry = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    sample_size = 1.0,
    stop_iter = Inf
  )
  
  # candidate recipes ----------------------------------------------------------
  rec_elnet <- recipe(responded ~ ., dat_train) %>%
    update_role(voterbase_id, new_role = "id_variable") %>% 
    step_date(call_date, features = c("dow", "month")) %>% 
    step_select(-call_date) %>% 
    step_dummy(all_nominal(), -responded, -voterbase_id, one_hot = TRUE) %>%
    step_zv(all_predictors()) %>% 
    step_normalize(all_predictors())

  rec_trees <- recipe(responded ~ ., dat_train) %>%
    update_role(voterbase_id, new_role = "id_variable") %>% 
    step_date(call_date, features = c("dow", "month")) %>% 
    step_select(-call_date)
  
  # make workflowset -----------------------------------------------------------
  wfs <- workflow_set(
    preproc = list(rec_trees, rec_trees, rec_elnet),
    models = list(
      rand_forest = spec_rf,
      xgboost = spec_xgb,
      elastic_net = spec_elnet
    )
  )
  
  # out ------------------------------------------------------------------------
  return(wfs)
  
}

do_cv <- function(dat_train, folds, grid_size) {
  
  wfs <- do_wfs(dat_train)
  folds <- vfold_cv(dat_train, v = folds)
  out <- wfs %>% 
    workflow_map(
      "tune_grid", 
      grid = grid_size, 
      resamples = folds,
      metrics = metric_set(
        accuracy, 
        bal_accuracy,
        f_meas,
        roc_auc,
        sensitivity, 
        specificity, 
        precision
      ),
      control = control_grid(verbose = TRUE)
    )
  
  return(out)
  
}

Then, after reading the data in, the script is just:

res <- do_cv(dat, 5, 30) %>% 
  rowwise() %>%
  mutate(metrics = list(collect_metrics(result))) %>%
  select(wflow_id, metrics) %>%
  unnest(metrics)

I started by doing the elastic net first, and it got through training all 30 models for each of the 5 folds. It made it about 3 or 4 models into the first fold for a random forest, and then I ran out of memory.

Is there something I can optimize here?

The mapping workflowset saves each model, correct? I don't want to do that—I'm going to toss all of these models anyways by just pulling out the hyperparameter values and metrics I want. I will do the analysis of cross-validation results in a new script and fit the final model in another new script informed by those analyses. My theory is that I run out of memory because the workflowset mapping is saving every single model (450 models), which fans out and explodes memory. Is this true? If so, is there a way to extract (a) hyperparameter values, and (b) mean and SE of each metric from the model before tossing it and/or not saving it?

In general, how can I make this more computationally efficient? I think the workflowset is more elegant than scikit-learn, but I'm running into significant memory issues here.

system · November 20, 2024, 7:38pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.