Apologies for the lack of reproducible example—I'm using proprietary data, working in a Linux cluster, and the issues I'm running into only occur after hours of the code running.
I have a dataset that is only 206k rows long and about 170 columns wide when factors are one-hot encoded (although I stopped doing this for tree-based methods, as I've seen suggested elsewhere for tidymodels).
I'm on a Linux cluster that has 128G of memory to use. I've simplified it to running 3 algorithms (elastic net, random forest, XGBoost) on 5-fold cross-validation with a tune grid the size of 30. So that would be 3 x 5 x 30 = 450 models.
I define a function to make a workflowset, run cross-validation, and then I have script that runs it and pulls out only the information I'm interested in: The values of the hyperparameters and their metrics:
do_wfs <- function(dat_train) {
# candidate models -----------------------------------------------------------
spec_elnet <- logistic_reg(
engine = "glmnet",
penalty = tune(),
mixture = tune()
)
spec_rf <- rand_forest(
mode = "classification",
engine = "ranger",
trees = 500,
mtry = tune(),
min_n = tune()
)
spec_xgb <- boost_tree(
mode = "classification",
engine = "xgboost",
tree_depth = tune(),
trees = 15,
learn_rate = tune(),
mtry = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = 1.0,
stop_iter = Inf
)
# candidate recipes ----------------------------------------------------------
rec_elnet <- recipe(responded ~ ., dat_train) %>%
update_role(voterbase_id, new_role = "id_variable") %>%
step_date(call_date, features = c("dow", "month")) %>%
step_select(-call_date) %>%
step_dummy(all_nominal(), -responded, -voterbase_id, one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
rec_trees <- recipe(responded ~ ., dat_train) %>%
update_role(voterbase_id, new_role = "id_variable") %>%
step_date(call_date, features = c("dow", "month")) %>%
step_select(-call_date)
# make workflowset -----------------------------------------------------------
wfs <- workflow_set(
preproc = list(rec_trees, rec_trees, rec_elnet),
models = list(
rand_forest = spec_rf,
xgboost = spec_xgb,
elastic_net = spec_elnet
)
)
# out ------------------------------------------------------------------------
return(wfs)
}
do_cv <- function(dat_train, folds, grid_size) {
wfs <- do_wfs(dat_train)
folds <- vfold_cv(dat_train, v = folds)
out <- wfs %>%
workflow_map(
"tune_grid",
grid = grid_size,
resamples = folds,
metrics = metric_set(
accuracy,
bal_accuracy,
f_meas,
roc_auc,
sensitivity,
specificity,
precision
),
control = control_grid(verbose = TRUE)
)
return(out)
}
Then, after reading the data in, the script is just:
res <- do_cv(dat, 5, 30) %>%
rowwise() %>%
mutate(metrics = list(collect_metrics(result))) %>%
select(wflow_id, metrics) %>%
unnest(metrics)
I started by doing the elastic net first, and it got through training all 30 models for each of the 5 folds. It made it about 3 or 4 models into the first fold for a random forest, and then I ran out of memory.
Is there something I can optimize here?
The mapping workflowset saves each model, correct? I don't want to do that—I'm going to toss all of these models anyways by just pulling out the hyperparameter values and metrics I want. I will do the analysis of cross-validation results in a new script and fit the final model in another new script informed by those analyses. My theory is that I run out of memory because the workflowset mapping is saving every single model (450 models), which fans out and explodes memory. Is this true? If so, is there a way to extract (a) hyperparameter values, and (b) mean and SE of each metric from the model before tossing it and/or not saving it?
In general, how can I make this more computationally efficient? I think the workflowset is more elegant than scikit-learn, but I'm running into significant memory issues here.