Problems trying to use workflow_set with multiple tunable models and recipes

jgarrigan · November 2, 2023, 12:04am

Hi,

I'm trying to learn how to use multiple model with tune and workflow_set.

When I try apply workflow_map to my workflow_set using my CV resamples I get no error however when I inspect the grid results I see the results column contains 'try-errr'.

I'm trying to follow along to this example in the TMwR book using the ames dataset, my code looks as follows:

pacman::p_load(tidymodels,tidyverse,doParallel,janitor,AmesHousing,vip)

set.seed(1234)

# load the housing data and clean names
ames_data <- make_ames() %>%
  janitor::clean_names()

# split into training and testing datasets. Stratify by Sale price 
ames_split <- rsample::initial_split(
  ames_data, 
  prop = 0.8, 
  strata = sale_price
)

# CREATE TRAINING AND TESTING OBJECTS FROM THE SPLIT OBJECT
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

# CREATE RESAMPLES TO CHOOSE AND COMPARE MODELS
set.seed(234)
ames_folds <- vfold_cv(ames_train, strata = sale_price)


# EDA ---------------------------------------------------------------------

# DEFINE PREPROCESSING RECIPES --------------------------------------------

base_rec <-
  recipe(sale_price ~ ., data = ames_train) %>%
  step_log(sale_price, base = 10) %>%
  step_YeoJohnson(lot_area, gr_liv_area) %>%
  step_other(neighborhood, threshold = .1)  %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_ns(longitude, deg_free = tune("long df")) %>%
  step_ns(latitude, deg_free = tune("lat df"))

# REDUCE THE CARDINALITY OF SOME OF THE CATEGORICAL VARIABLES
low_cardinality_recipe <- base_rec %>% 
  # convert categorical variables to factors
  recipes::step_string2factor(all_nominal()) %>%
  # combine low frequency factor levels
  recipes::step_other(all_nominal(), other = "Other", threshold = 0.01) %>%
  # remove no variance predictors which provide no predictive information 
  recipes::step_nzv(all_nominal())

# CREATE A PRINCIPAL COMPONENT ANALYSIS RECIPE
pca_recipe <- recipe(sale_price ~ ., data = ames_train) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors(), num_comp = tune())

# # UPDATE THE GENERIC PARAMETERS FOR DEGREES OF FREEDOM SINCE IT HAS A SMALL RANGE
# # SPLINE DEGREES HAS MORE APPROPIATE VALUES FOR SPLINES
# ames_param <- 
#   base_rec %>% 
#   parameters() %>% 
#   update(
#     `long df` = spline_degree(), 
#     `lat df` = spline_degree()
#   )

# BUILD A MODEL -----------------------------------------------------------

# DEFINE AN XGBOOST MODEL
xgb_spec <- boost_tree(
  trees = 500,
  tree_depth = tune(), 
  min_n = tune(),
  loss_reduction = tune(),                     
  sample_size = tune(), 
  mtry = tune(),           
  learn_rate = tune(),
  engine = "xgboost",
  mode = "regression"
)

# DEFINE AN RANDOM FOREST MODEL
rf_spec <- rand_forest(
  trees = 500,
  mtry = tune(),
  min_n = tune(),
  engine = "ranger",
  mode = "regression"
)

# DEFINE A WORKFLOW SET ---------------------------------------------------
ames_set <- workflowsets::workflow_set(
  preproc = list(base_rec, low_cardinality_recipe, pca_recipe),
  models = list(xgb_spec, rf_spec)
)

ames_set

# SET UP PARALLEL PROCESSING
doParallel::registerDoParallel(detectCores())

grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )

grid_results <- ames_set %>% 
  workflow_map(
    seed = 1234,
    resamples = ames_folds,
    grid = 30,
    control = grid_ctrl
  )

grid_results

While researching the error I came across this post but I don't believe I've enabled logging while using parallel processing

In the image below, notice the try-errr message:

Any help you can provide would be greatly appreciated.

hannah · November 2, 2023, 11:33am

If you take a look at what error occurred by printing grid_results$result[[1]] to the console, you get

! Some model parameters require finalization but there are recipe parameters that require tuning. Please use  `extract_parameter_set_dials()` to set parameter ranges manually and supply the output to the `param_info` argument.

You are trying to tune mtry, for which the upper limit of possible values depends on the number of predictors available. Since that is not known a priori, the parameter object for it has a value of unknown()by default. Once the data is available, we can fill in that upper limit, which means we "finalize" the parameter object. But here you are also trying to tune over the preprocessing (with the recipe), so we don't really know what the upper limit can be based on the data. Therefore, the error message asks you to provide the range of possible values for mtry manually. You can do this like this, and then pass it to tune_grid() via the param_info argument, as suggested in the error message:

library(tidymodels)

rf_spec <- rand_forest(
  trees = 500,
  mtry = tune(),
  min_n = tune(),
  engine = "ranger",
  mode = "regression"
)

parameter_set_with_unknowns <- extract_parameter_set_dials(rf_spec)
parameter_set_with_unknowns
#> Collection of 2 parameters for tuning
#> 
#>  identifier  type    object
#>        mtry  mtry nparam[?]
#>       min_n min_n nparam[+]
#> 
#> Model parameters needing finalization:
#>    # Randomly Selected Predictors ('mtry')
#> 
#> See `?dials::finalize` or `?dials::update.parameters` for more information.

# use a range appropriate for your problem instead of `c(1, 10)`
finalized_parameter_set <-  parameter_set_with_unknowns %>% 
  update(mtry = mtry(range = c(1, 10)))

finalized_parameter_set
#> Collection of 2 parameters for tuning
#> 
#>  identifier  type    object
#>        mtry  mtry nparam[+]
#>       min_n min_n nparam[+]

^{Created on 2023-11-02 by the reprex package (v2.0.1)}

jgarrigan · November 3, 2023, 8:20pm

Hi Hannah,

Thanks for the reply.

I may be misunderstanding you but I've got tuning parameters in two of my three recipes, base_rec (splines) and pca_rec (num_comp) as well as the two models (xgb & rf). For my two recipes I've extracted the parameters and updated the values. I've done the same for both the models and supplied this via workflowsets::option_add() however now all my models are failing to fit.

Have I captured the update parameters correctly:

pacman::p_load(tidymodels, tidyverse, doParallel, janitor, AmesHousing, vip, randomForest)

set.seed(1234)

# load the housing data and clean names
ames_data <- make_ames() %>%
  janitor::clean_names()

# split into training and testing datasets. Stratify by Sale price
ames_split <- rsample::initial_split(
  ames_data,
  prop = 0.8,
  strata = sale_price
)

# CREATE TRAINING AND TESTING OBJECTS FROM THE SPLIT OBJECT
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

# CREATE RESAMPLES TO CHOOSE AND COMPARE MODELS
set.seed(234)
ames_folds <- vfold_cv(ames_train, strata = sale_price)

# EDA ---------------------------------------------------------------------

# DEFINE PREPROCESSING RECIPES --------------------------------------------

base_rec <-
  recipe(sale_price ~ ., data = ames_train) %>%
  step_log(sale_price, base = 10) %>%
  step_YeoJohnson(lot_area, gr_liv_area) %>%
  step_other(neighborhood, threshold = .1) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_ns(longitude, deg_free = tune("long df")) %>%
  step_ns(latitude, deg_free = tune("lat df"))

# PROVIDE RANGES OF VALUES TO USE IN THE LAT LONG VALUES
spline_param <-
  base_rec %>%
  extract_parameter_set_dials() %>%
  update(
    "long df" = spline_degree(),
    "lat df" = spline_degree()
  )

# REDUCE THE CARDINALITY OF SOME OF THE CATEGORICAL VARIABLES
low_cardinality_recipe <- base_rec %>%
  # convert categorical variables to factors
  recipes::step_string2factor(all_nominal()) %>%
  # combine low frequency factor levels
  recipes::step_other(all_nominal(), other = "Other", threshold = 0.01) %>%
  # remove no variance predictors which provide no predictive information
  recipes::step_nzv(all_nominal())

# CREATE A PRINCIPAL COMPONENT ANALYSIS RECIPE
pca_recipe <- recipe(sale_price ~ ., data = ames_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors(), num_comp = tune())

# MANUALLY PROVIDE RANGES FOR THE NUNBER OF COMPONENTS
pca_param <-
  pca_recipe %>%
  extract_parameter_set_dials() %>%
  update(num_comp = num_comp(c(0, 20)))

# BUILD A MODEL -----------------------------------------------------------

# DEFINE AN XGBOOST MODEL
xgb_spec <- boost_tree(
  trees = 500,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost", importance = TRUE) %>%
  set_mode("regression")

xgb_param <-
  extract_parameter_set_dials(xgb_spec)

xgb_param

new_xgb_param <- xgb_param %>% 
  update(mtry = mtry(range = c(1, 80)))

# DEFINE AN RANDOM FOREST MODEL
rf_spec <- rand_forest(
  trees = 500L,
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("regression")

rf_param <-
  extract_parameter_set_dials(rf_spec)

rf_param
  
new_rf_param <- rf_param %>% 
  update(mtry = mtry(range = c(1, 80)))


# DEFINE A WORKFLOW SET ---------------------------------------------------

wf_set <- workflowsets::workflow_set(
  preproc = list(
    base = base_rec,
    step_other = low_cardinality_recipe,
    pca = pca_recipe
  ),
  models = list(
    xgboost = xgb_spec,
    rf = rf_spec
  ),
  cross = TRUE
)

wf_set

wf_set_new <- wf_set %>%
  #workflowsets::option_add_parameters() %>%
  workflowsets::option_add(param_info = spline_param, id = "base_xgboost") %>%
  workflowsets::option_add(param_info = spline_param, id = "base_rf") %>%
  workflowsets::option_add(param_info = pca_param, id = "pca_xgboost") %>%
  workflowsets::option_add(param_info = pca_param, id = "pca_rf") %>% 
  workflowsets::option_add(param_info = new_rf_param, id = "base_rf") %>% 
  workflowsets::option_add(param_info = new_rf_param, id = "step_other_rf") %>%
  workflowsets::option_add(param_info = new_rf_param, id = "pca_rf") %>% 
  workflowsets::option_add(param_info = new_xgb_param, id = "base_xgboost") %>% 
  workflowsets::option_add(param_info = new_xgb_param, id = "step_other_xgboost") %>%
  workflowsets::option_add(param_info = new_xgb_param, id = "pca_xgboost")

wf_set_new

# SET UP PARALLEL PROCESSING
doParallel::registerDoParallel(detectCores())

fit_workflows <- wf_set_new %>%
  workflow_map(
    seed = 1234,
    grid = 20,
    resamples = ames_folds,
    verbose = TRUE
  )

fit_workflows$result[[1]] 

# TURN OFF PARALLEL COMPUTE
doParallel::stopImplicitCluster()

fit_workflows

fit_workflows %>% 
  rank_results() %>% 
  filter(.metric == "rmse") %>% 
  select(model, .config, rmse = mean, rank)

Below is the warning I receive:

1 of 6 tuning:     base_xgboost
✖ 1 of 6 tuning:     base_xgboost failed with 
i 2 of 6 tuning:     base_rf
✖ 2 of 6 tuning:     base_rf failed with 
i 3 of 6 tuning:     step_other_xgboost
✖ 3 of 6 tuning:     step_other_xgboost failed with 
i 4 of 6 tuning:     step_other_rf
✖ 4 of 6 tuning:     step_other_rf failed with 
i 5 of 6 tuning:     pca_xgboost
✖ 5 of 6 tuning:     pca_xgboost failed with 
i 6 of 6 tuning:     pca_rf
✖ 6 of 6 tuning:     pca_rf failed with 
There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: All models failed. Run `show_notes(.Last.tune.result)` for more information.
2: Unknown or uninitialised column: `.notes`.
3: All models failed. Run `show_notes(.Last.tune.result)` for more information.
4: Unknown or uninitialised column: `.notes`.
5: All models failed. Run `show_notes(.Last.tune.result)` for more information.
6: Unknown or uninitialised column: `.notes`.
7: All models failed. Run `show_notes(.Last.tune.result)` for more information.
8: Unknown or uninitialised column: `.notes`.
9: All models failed. Run `show_notes(.Last.tune.result)` for more information.
10: Unknown or uninitialised column: `.notes`.
11: All models failed. Run `show_notes(.Last.tune.result)` for more information.
12: Unknown or uninitialised column: `.notes`.

Using show_notes(.Last.tune.result) provides the following info:

show_notes(.Last.tune.result)
unique notes:
────────────────────────────────────────────────────────
Error in `iter_grid_info[1L, preprocessor_param_names]`:
! Can't subset columns that don't exist.
✖ Column `num_comp` doesn't exist.

Any direction you can provide would be greatly appreciated, thanks

jgarrigan · November 7, 2023, 11:38pm

Any help/guidance that can be shared here would be greatly appreciated.

Thanks

hannah · November 8, 2023, 10:17pm

There is a lot going on in your example! Naturally, if you want to try out workflow sets Could you try to reduce the complexity though?

Some of the steps helpful for this could be

to turn off parallel processing
to check the individual workflows before putting them in a workflow set
to slim down the workflows, in terms of tuning parameters and preprocessing steps

That way you can narrow down what the problem is. It may be a bug but often it's also the combination of data characteristics and model/preprocessing specifications.

system · November 29, 2023, 10:18pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.