I am using tidymodels to build a workflowset (comprising multiple recipes and models), tune the workflowset, and then stack it. After calling blend_predictions on the stack (and/or after fitting it), we can pass it to autoplot with the argument type="weights" to produce a nice bar job of the selected models and their coefficients. How can I get the selected models and stacking coefficients in a structured, orderly format (e.g., a tibble)? Calling collect_parameters() seems to throw an error no matter what I pass to it (the initial data stack after add_candidates(), the model stack after blend_predictions(), or the fitted stack after fit_members()). I am including a reprex below.
First, the setup:
# Load necessary libraries
library(tidymodels)
library(stacks)
library(dplyr)
library(future)
library(modeldata)
# Configure multicore processing and set seed
plan(multisession)
set.seed(123)
# Load data
data(attrition)
# Split data into training and testing sets
data_split <- initial_split(attrition, prop = 0.75, strata = Attrition)
data_train <- training(data_split)
data_test <- testing(data_split)
Then the recipes and models:
# Create recipes
base_recipe <- recipe(Attrition ~ ., data = data_train) %>%
step_zv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.9) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
pls_recipe <- base_recipe %>%
step_pls(all_numeric_predictors(), outcome = vars(Attrition), num_comp = ncol(data_train) - 1)
# Specify models with tunable parameters
rf_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
xgb_spec <- boost_tree(trees = 1000, tree_depth = tune(), learn_rate = tune(), loss_reduction = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
log_reg_spec <- logistic_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
Now combine these into a workflow set and tune it:
# Create a workflow set
workflow_set <- workflow_set(
preproc = list(base = base_recipe, pls = pls_recipe),
models = list(rf = rf_spec, xgb = xgb_spec, log_reg = log_reg_spec),
cross = TRUE
)
# Tune the workflow set with cross-validation and a tuning grid
set.seed(123)
cv_folds <- vfold_cv(data_train, v = 5)
tune_results <- workflow_map(
object = workflow_set,
fn = "tune_grid",
verbose = TRUE,
seed = 123,
resamples = cv_folds,
grid = 10,
metrics = metric_set(roc_auc, accuracy, f_meas),
control = control_grid(verbose = FALSE, allow_par = TRUE, save_pred = TRUE, save_workflow = TRUE, parallel_over = "everything")
)
At this point we could do some other stuff to plot the tuning results, extract the best workflows and use them to individually predict on new data, etc. But let's jump to the stack, which can work from the tuned workflowset:
# stack models
stack_models <- stacks() %>%
add_candidates(tune_results)
# blend predictions
stack_blended <- stack_models %>%
blend_predictions()
# Fit the stacked model
stack_fitted <- stack_blended %>%
fit_members()
# view the selected models and coefficients as a chart
autoplot(stack_blended, type="weights") # could also use stack_fitted
# collect the coefficients - FAILS IN ALL CASES
stack_coefficients <- stack_models %>%
collect_parameters(candidates = tune_results)
stack_coefficients <- stack_blended %>%
collect_parameters(candidates = tune_results)
stack_coefficients <- stack_fitted %>%
collect_parameters(candidates = tune_results)
The error I always receive is:
Error in if ((!inherits(candidates, "character")) | (!candidates %in% :
the condition has length > 1
Sometimes the process runs for an extensive time period before finally erroring out.
I am using the latest version of R (4.4.0) and RStudio (2024.04.2+764, the version that came out yesterday—although this issue was happening before that), and all packages are up to date (tidymodels: 1.2.0, stacks: 1.0.4, dplyr: 1.1.4, future: 1.33.2, modeldata: 1.3.0).
Any suggestions on how to resolve this issue and/or get an "orderly" list of the stack model coefficients?