Getting stack candidate coefficients and tuning parameters with collect_parameters {stacks}

lyndess · June 11, 2024, 11:55pm

I am using tidymodels to build a workflowset (comprising multiple recipes and models), tune the workflowset, and then stack it. After calling blend_predictions on the stack (and/or after fitting it), we can pass it to autoplot with the argument type="weights" to produce a nice bar job of the selected models and their coefficients. How can I get the selected models and stacking coefficients in a structured, orderly format (e.g., a tibble)? Calling collect_parameters() seems to throw an error no matter what I pass to it (the initial data stack after add_candidates(), the model stack after blend_predictions(), or the fitted stack after fit_members()). I am including a reprex below.

First, the setup:

# Load necessary libraries
library(tidymodels)
library(stacks)
library(dplyr)
library(future)
library(modeldata)

# Configure multicore processing and set seed
plan(multisession)
set.seed(123)

# Load data
data(attrition)

# Split data into training and testing sets
data_split <- initial_split(attrition, prop = 0.75, strata = Attrition)
data_train <- training(data_split)
data_test <- testing(data_split)

Then the recipes and models:

# Create recipes
base_recipe <- recipe(Attrition ~ ., data = data_train) %>%
  step_zv(all_predictors()) %>%
  step_naomit(all_predictors()) %>%
  step_corr(all_numeric_predictors(), threshold = 0.9) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

pls_recipe <- base_recipe %>%
  step_pls(all_numeric_predictors(), outcome = vars(Attrition), num_comp = ncol(data_train) - 1)

# Specify models with tunable parameters
rf_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

xgb_spec <- boost_tree(trees = 1000, tree_depth = tune(), learn_rate = tune(), loss_reduction = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

log_reg_spec <- logistic_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

Now combine these into a workflow set and tune it:

# Create a workflow set
workflow_set <- workflow_set(
  preproc = list(base = base_recipe, pls = pls_recipe),
  models = list(rf = rf_spec, xgb = xgb_spec, log_reg = log_reg_spec),
  cross = TRUE
)

# Tune the workflow set with cross-validation and a tuning grid
set.seed(123)
cv_folds <- vfold_cv(data_train, v = 5)

tune_results <- workflow_map(
  object = workflow_set,
  fn = "tune_grid",
  verbose = TRUE,
  seed = 123,
  resamples = cv_folds,
  grid = 10,
  metrics = metric_set(roc_auc, accuracy, f_meas),
  control = control_grid(verbose = FALSE, allow_par = TRUE, save_pred = TRUE, save_workflow = TRUE, parallel_over = "everything")
)

At this point we could do some other stuff to plot the tuning results, extract the best workflows and use them to individually predict on new data, etc. But let's jump to the stack, which can work from the tuned workflowset:

# stack models
stack_models <- stacks() %>%
  add_candidates(tune_results)

# blend predictions
stack_blended <- stack_models %>% 
  blend_predictions()

# Fit the stacked model
stack_fitted <- stack_blended %>%
  fit_members()

# view the selected models and coefficients as a chart
autoplot(stack_blended, type="weights")  # could also use stack_fitted

# collect the coefficients - FAILS IN ALL CASES
stack_coefficients <- stack_models %>% 
  collect_parameters(candidates = tune_results)

stack_coefficients <- stack_blended %>% 
  collect_parameters(candidates = tune_results)

stack_coefficients <- stack_fitted %>% 
  collect_parameters(candidates = tune_results)

The error I always receive is:

Error in if ((!inherits(candidates, "character")) | (!candidates %in%  : 
 the condition has length > 1

Sometimes the process runs for an extensive time period before finally erroring out.

I am using the latest version of R (4.4.0) and RStudio (2024.04.2+764, the version that came out yesterday—although this issue was happening before that), and all packages are up to date (tidymodels: 1.2.0, stacks: 1.0.4, dplyr: 1.1.4, future: 1.33.2, modeldata: 1.3.0).

Any suggestions on how to resolve this issue and/or get an "orderly" list of the stack model coefficients?

simonpcouch · June 12, 2024, 2:16am

Thanks for the Community post! I've just pushed a fix so that that error is more informative in the future. The issue here is that "candidates" is the name of a set of tuning results rather than the tuning results themselves (see the argument documentation. So, e.g. passing a candidates argument like:

stack_models %>% 
  collect_parameters(candidates = "base_rf")

...will do the trick.

lyndess · June 12, 2024, 3:31am

And another answer just popped into my head!

Assign a name to the ggplot object created with autoplot:

plot <- autoplot(stack_blended, type="weights")

... and then extract the data from it:

plot$data

#> # A tibble: 11 × 4
#>    terms                       model        weight member
#>    <chr>                       <chr>         <dbl>  <int>
#>  1 .pred_Yes_pls_xgb_1_03      boost_tree    0.160      1
#>  2 .pred_Yes_base_log_reg_1_02 logistic_reg  0.233      2
#>  3 .pred_Yes_pls_xgb_1_09      boost_tree    0.360      3
#>  4 .pred_Yes_base_log_reg_1_06 logistic_reg  0.439      4
#>  5 .pred_Yes_base_xgb_1_06     boost_tree    0.602      5
#>  6 .pred_Yes_base_xgb_1_08     boost_tree    0.602      6
#>  7 .pred_Yes_pls_xgb_1_06      boost_tree    0.619      7
#>  8 .pred_Yes_pls_xgb_1_08      boost_tree    0.860      8
#>  9 .pred_Yes_base_log_reg_1_03 logistic_reg  0.983      9
#> 10 .pred_Yes_base_rf_1_01      rand_forest   1.03      10
#> 11 .pred_Yes_base_log_reg_1_08 logistic_reg  1.13      11

And upon further investigation, I now see that calls to collect_parameters()must be made for each individual candidate, because the output includes the tuning parameters—which are potentially different from each candidate (if they were constructed from different models):

stack_blended %>% collect_parameters(candidates = tune_results$wflow_id[[1]])

#> # A tibble: 10 × 5
#>    member        mtry min_n terms                   coef
#>    <chr>        <int> <int> <chr>                  <dbl>
#>  1 base_rf_1_01    15    30 .pred_Yes_base_rf_1_01  1.03
#>  2 base_rf_1_02    53     3 .pred_Yes_base_rf_1_02  0   
#>  3 base_rf_1_03     8    39 .pred_Yes_base_rf_1_03  0   
#>  4 base_rf_1_04    24    27 .pred_Yes_base_rf_1_04  0   
#>  5 base_rf_1_05    52    36 .pred_Yes_base_rf_1_05  0   
#>  6 base_rf_1_06    22    20 .pred_Yes_base_rf_1_06  0   
#>  7 base_rf_1_07    30    24 .pred_Yes_base_rf_1_07  0   
#>  8 base_rf_1_08    46    12 .pred_Yes_base_rf_1_08  0   
#>  9 base_rf_1_09    38    15 .pred_Yes_base_rf_1_09  0   
#> 10 base_rf_1_10     5     7 .pred_Yes_base_rf_1_10  0  

stack_blended %>% collect_parameters(candidates = tune_results$wflow_id[[2]])

#> # A tibble: 10 × 6
#>    member        tree_depth learn_rate loss_reduction terms                    coef
#>    <chr>              <int>      <dbl>          <dbl> <chr>                   <dbl>
#>  1 base_xgb_1_01          6    0.00122  0.0318        .pred_Yes_base_xgb_1_01 0    
#>  2 base_xgb_1_02         11    0.00382  0.0000000286  .pred_Yes_base_xgb_1_02 0    
#>  3 base_xgb_1_03          8    0.148    1.37          .pred_Yes_base_xgb_1_03 0    
#>  4 base_xgb_1_04          9    0.0254   3.59          .pred_Yes_base_xgb_1_04 0    
#>  5 base_xgb_1_05         15    0.00925  0.00000000124 .pred_Yes_base_xgb_1_05 0    
#>  6 base_xgb_1_06          2    0.0516   0.0000214     .pred_Yes_base_xgb_1_06 0.602
#>  7 base_xgb_1_07         12    0.0154   0.000820      .pred_Yes_base_xgb_1_07 0    
#>  8 base_xgb_1_08          5    0.00254  0.00000000316 .pred_Yes_base_xgb_1_08 0.602
#>  9 base_xgb_1_09          7    0.240    0.000266      .pred_Yes_base_xgb_1_09 0    
#> 10 base_xgb_1_10          3    0.0632   0.00000139    .pred_Yes_base_xgb_1_10 0

Thank you for the guidance, Simon!

lyndess · June 12, 2024, 5:08am

One more update so that others can benefit ...

To get all of the coefficients and tuning parameters into a single object, one approach would be to map over the blended stack and collect parameters from each candidate, and then—because different workflows will have different parameters—pivot the tibble wider before binding them all together:

# Use map with an implicit function to collect and pivot parameters, and then combine the results
stack_params_long <- names(stack_blended$model_defs) %>%
  map(function(candidate) {
    stack_blended %>%
      collect_parameters(candidates = candidate) %>%
      pivot_longer(cols = -c(member, terms, coef), names_to = "parameter", values_to = "value") %>%
      mutate(model = candidate, .before = 1)
  }) %>%
  list_rbind()

# Display the combined tibble
stack_params_long

#> # A tibble: 140 × 6
#>    model   member       terms                   coef parameter value
#>    <chr>   <chr>        <chr>                  <dbl> <chr>     <dbl>
#>  1 base_rf base_rf_1_01 .pred_Yes_base_rf_1_01  1.03 mtry         15
#>  2 base_rf base_rf_1_01 .pred_Yes_base_rf_1_01  1.03 min_n        30
#>  3 base_rf base_rf_1_02 .pred_Yes_base_rf_1_02  0    mtry         53
#>  4 base_rf base_rf_1_02 .pred_Yes_base_rf_1_02  0    min_n         3
#>  5 base_rf base_rf_1_03 .pred_Yes_base_rf_1_03  0    mtry          8
#>  6 base_rf base_rf_1_03 .pred_Yes_base_rf_1_03  0    min_n        39
#>  7 base_rf base_rf_1_04 .pred_Yes_base_rf_1_04  0    mtry         24
#>  8 base_rf base_rf_1_04 .pred_Yes_base_rf_1_04  0    min_n        27
#>  9 base_rf base_rf_1_05 .pred_Yes_base_rf_1_05  0    mtry         52
#> 10 base_rf base_rf_1_05 .pred_Yes_base_rf_1_05  0    min_n        36
#> # ℹ 130 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Of course, one problem with this might be if there are non-numeric tuning parameters, since pivot_longer results in a numeric "value" column,

Another approach would be to create two objects: first, create a list that simply aggregates all of the results of calling collect_parameters:

# Create a list where each item is the result of calling `collect_parameters`
stack_params <- names(stack_blended$model_defs) %>%
  map(~ stack_blended %>% collect_parameters(candidates = .x)) %>%
  set_names(names(stack_blended$model_defs))

Retrieving elements from this list might be easier than multiple calls to collect_parameters.

Once we have this list, we can "flatten it" to the same type of long-format tibble that we created above:

# combine and pivot the results into a long format tibble
stack_params_long <- stack_params %>%
  imap(~ .x %>%
         pivot_longer(cols = -c(member, terms, coef), names_to = "parameter", values_to = "value") %>%
         mutate(wflow_id = .y, .before = 1)) %>%
  list_rbind()

# Display the combined tibble
stack_params_long

#> # A tibble: 140 × 6
#>    model   member       terms                   coef parameter value
#>    <chr>   <chr>        <chr>                  <dbl> <chr>     <dbl>
#>  1 base_rf base_rf_1_01 .pred_Yes_base_rf_1_01  1.03 mtry         15
#>  2 base_rf base_rf_1_01 .pred_Yes_base_rf_1_01  1.03 min_n        30
#>  3 base_rf base_rf_1_02 .pred_Yes_base_rf_1_02  0    mtry         53
#>  4 base_rf base_rf_1_02 .pred_Yes_base_rf_1_02  0    min_n         3
#>  5 base_rf base_rf_1_03 .pred_Yes_base_rf_1_03  0    mtry          8
#>  6 base_rf base_rf_1_03 .pred_Yes_base_rf_1_03  0    min_n        39
#>  7 base_rf base_rf_1_04 .pred_Yes_base_rf_1_04  0    mtry         24
#>  8 base_rf base_rf_1_04 .pred_Yes_base_rf_1_04  0    min_n        27
#>  9 base_rf base_rf_1_05 .pred_Yes_base_rf_1_05  0    mtry         52
#> 10 base_rf base_rf_1_05 .pred_Yes_base_rf_1_05  0    min_n        36
#> # ℹ 130 more rows
#> # ℹ Use `print(n = ...)` to see more rows

system · June 19, 2024, 5:09am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.