I'm using workflowsets
to compare between different models. The data is relatively large (> 500K rows) and the entire procedure is expected to take several days to complete. Unfortunately, the machine I'm running this on is (1) remote; and (2) easy to crash. Since the unstable machine is a given, one strategy to overcome the instability is to break up the fitted workflows to individual .rds
files. My hope was, that once I have one .rds
file per completed workflow, I could load all rds
s and bind them together after the fact.
Sadly, it turns out that one rds
file per 1 workflow is nevertheless too large of a unit, as I can't even get such one unit to complete. Therefore, I need a "safety net" that writes more rds
files and more frequently, such that if the machine crashes at any moment, I could resume without losing the time invested in computing previous work up to the crash.
My question is: how can I program around workflowsets
to break down -- on the go -- to as many small .rds
files that I could build back again at the end?
In my real-life situation, I have 2 recipes × 5 model specs = 10 workflows. I'm using a 10-fold cross validation, and a tuning grid of 25
.
To anchor my question in a reproducible example, please consider the following code that I took almost as-is from www.tmwr.org/workflow-sets.html.
library(tidymodels)
library(rules)
library(baguette)
tidymodels_prefer()
data(concrete, package = "modeldata")
concrete <-
concrete %>%
group_by(across(-compressive_strength)) %>%
summarize(compressive_strength = mean(compressive_strength),
.groups = "drop")
set.seed(1501)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)
set.seed(1502)
concrete_folds <-
vfold_cv(concrete_train, strata = compressive_strength, repeats = 5)
normalized_rec <-
recipe(compressive_strength ~ ., data = concrete_train) %>%
step_normalize(all_predictors())
poly_recipe <-
normalized_rec %>%
step_poly(all_predictors()) %>%
step_interact(~ all_predictors():all_predictors())
linear_reg_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
nnet_spec <-
mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>%
set_engine("nnet", MaxNWts = 2600) %>%
set_mode("regression")
mars_spec <-
mars(prod_degree = tune()) %>% #<- use GCV to choose terms
set_engine("earth") %>%
set_mode("regression")
svm_r_spec <-
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
svm_p_spec <-
svm_poly(cost = tune(), degree = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
knn_spec <-
nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
cart_spec <-
decision_tree(cost_complexity = tune(), min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
bag_cart_spec <-
bag_tree() %>%
set_engine("rpart", times = 50L) %>%
set_mode("regression")
rf_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression")
xgb_spec <-
boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(),
min_n = tune(), sample_size = tune(), trees = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
cubist_spec <-
cubist_rules(committees = tune(), neighbors = tune()) %>%
set_engine("Cubist")
nnet_param <-
nnet_spec %>%
parameters() %>%
update(hidden_units = hidden_units(c(1, 27)))
normalized <-
workflow_set(
preproc = list(normalized = normalized_rec),
models = list(SVM_radial = svm_r_spec, SVM_poly = svm_p_spec,
KNN = knn_spec, neural_network = nnet_spec)
) %>%
option_add(param_info = nnet_param, id = "normalized_neural_network")
model_vars <-
workflow_variables(outcomes = compressive_strength,
predictors = everything())
no_pre_proc <-
workflow_set(
preproc = list(simple = model_vars),
models = list(MARS = mars_spec, CART = cart_spec, CART_bagged = bag_cart_spec,
RF = rf_spec, boosting = xgb_spec, Cubist = cubist_spec)
)
with_features <-
workflow_set(
preproc = list(full_quad = poly_recipe),
models = list(linear_reg = linear_reg_spec, KNN = knn_spec)
)
all_workflows <-
bind_rows(no_pre_proc, normalized, with_features) %>%
# Make the workflow ID's a little more simple:
mutate(wflow_id = gsub("(simple_)|(normalized_)", "", wflow_id))
grid_ctrl <-
control_grid(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
grid_results <-
all_workflows %>%
workflow_map(
seed = 1503,
resamples = concrete_folds,
grid = 25,
control = grid_ctrl, verbose = TRUE
)
If we focus on just the final piece of that code:
-
we have the
all_workflows
object> all_workflows ## # A workflow set/tibble: 12 x 4 ## wflow_id info option result ## <chr> <list> <list> <list> ## 1 MARS <tibble [1 x 4]> <opts[0]> <list [0]> ## 2 CART <tibble [1 x 4]> <opts[0]> <list [0]> ## 3 CART_bagged <tibble [1 x 4]> <opts[0]> <list [0]> ## 4 RF <tibble [1 x 4]> <opts[0]> <list [0]> ## 5 boosting <tibble [1 x 4]> <opts[0]> <list [0]> ## 6 Cubist <tibble [1 x 4]> <opts[0]> <list [0]> ## 7 SVM_radial <tibble [1 x 4]> <opts[0]> <list [0]> ## 8 SVM_poly <tibble [1 x 4]> <opts[0]> <list [0]> ## 9 KNN <tibble [1 x 4]> <opts[0]> <list [0]> ## 10 neural_network <tibble [1 x 4]> <opts[1]> <list [0]> ## 11 full_quad_linear_reg <tibble [1 x 4]> <opts[0]> <list [0]> ## 12 full_quad_KNN <tibble [1 x 4]> <opts[0]> <list [0]>
-
all_workflows
gives rise to the heavy-lifting procedure that usesworkflow_map()
:grid_results <- all_workflows %>% workflow_map( seed = 1503, resamples = concrete_folds, grid = 25, control = grid_ctrl, verbose = TRUE )
If I run it just like this -- it hangs for many hours then fails and I get nothing. Alas, even if I do one row at a time and save to rds
, it fails too and I still get nothing:
fitted_wflow_1 <-
all_workflows[1, ] %>% ## my intent was to manually change this each time: to all_workflows[2, ], etc...
workflow_map(
seed = 1503,
resamples = concrete_folds,
grid = 25,
control = grid_ctrl, verbose = TRUE
)
saveRDS(fitted_wflow_1, "fitted_wflow_1.rds") ## sadly I never get to here because `fitted_wflow_1` never gets created
Bottom line is: how can I split up all_workflows
beyond 1-rds-per-row? How fragmented can I get with as many little rds
pieces?