I posted this same question at tidymodels - Why does tune_grid find character variables instead of factors? - Stack Overflow.
I'm likely missing something pretty obvious, but I'm new to recipes and I don't understand how to use it.
====
Below is a self-contained code example.
- I have test_data with character columns (name, id, gender)
- I convert them all to factors
- I mark name and id as "informational" (i.e. not to be used model building)
- When I run tune_grid, it complains about name, id, and gender being character, even though 2 of them should be ignored, and all 3 are factors.
I want to keep all the columns around so I can debug issues, so I don't just want to drop them.
I also want to separate the data processing (with recipes) and the model training. I'll have lots of different preprocessing recipes, but I'll be building the same sort of model over and over.
Why is this happening?
Error:
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
Debug info during the run:
> final_model <- train_lasso_model(recipe_obj, processed_data)
vfold [2 × 2] (S3: vfold_cv/rset/tbl_df/tbl/data.frame)
$ splits:List of 2
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int 3
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold1"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int [1:2] 1 2
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold2"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
$ id : chr [1:2] "Fold1" "Fold2"
- attr(*, "v")= num 2
- attr(*, "repeats")= num 1
- attr(*, "breaks")= num 4
- attr(*, "pool")= num 0.1
- attr(*, "fingerprint")= chr "2c80c86a0361fcf4a6d480eb1b0b8d79"
before tune_grid
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
There were issues with some computations A: x2
after tune_grid
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
All models failed. Run `show_notes(.Last.tune.result)` for more information.
> rlang::last_trace()
<error/rlang_error>
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
---
Backtrace:
▆
1. ├─global train_lasso_model(recipe_obj, processed_data)
2. │ └─tune_results %>% select_best(metric = "roc_auc")
3. ├─tune::select_best(., metric = "roc_auc")
4. └─tune:::select_best.tune_results(., metric = "roc_auc")
5. ├─tune::show_best(...)
6. └─tune:::show_best.tune_results(...)
7. └─tune::.filter_perf_metrics(x, metric, eval_time)
8. └─tune::estimate_tune_results(x)
> train <- prepped_recipe %>% juice
> sapply(train[, info_vars], class)
name id gender
"factor" "factor" "factor"
> sapply(processed_data[, info_vars], class)
name id gender
"factor" "factor" "factor"
> class(processed_data)
[1] "tbl_df" "tbl" "data.frame"
> packageVersion("tune")
[1] ‘1.2.1’
Code:
library(recipes)
library(workflows)
train_lasso_model <- function(recipe_obj, processed_data,
grid_size = 10, folds=2) {
# Create a logistic regression model specification with Lasso regularization
log_reg_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# Create a workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
add_model(log_reg_spec)
# Set up cross-validation
cv_folds <- vfold_cv(processed_data, v = folds)
str(cv_folds)
# Tune the model to find the best regularization strength (penalty)
message("before tune_grid")
tune_results <- workflow_obj %>%
tune_grid(resamples = cv_folds, grid = grid_size)
message("after tune_grid")
# Check the best tuning parameters (lambda)
best_lambda <- tune_results %>%
select_best(metric = "roc_auc")
# Finalize the workflow with the best penalty
message("before finalize_workflow")
final_workflow <- workflow_obj %>%
finalize_workflow(best_lambda)
message("after finalize_workflow")
# Fit the final model
final_model <- fit(final_workflow, data = processed_data)
# Return the trained model
return(final_model)
}
test_data <- data.frame(
name = c("sam", "sam", "sam"),
id = c("1", "2", "3"),
gender = c("male", "female", "female"),
target = c(4, 5, 6)
)
info_vars <- c("name", "id",
# mark gender as informational, but still make it a dummy var
"gender")
recipe_obj <- recipe(target ~ ., data = test_data) %>%
# mark vars as not used in the model
update_role(
all_of(info_vars),
new_role = "informational") %>%
# Create an "unknown" category for all unknown factor levels
step_unknown(all_nominal(), skip = TRUE) %>%
# Convert factors/character columns to dummies
step_dummy(all_nominal(), -all_outcomes(), -all_of(info_vars),
gender,
keep_original_cols = TRUE)
prepped_recipe <- recipe_obj %>% prep(training = test_data)
processed_data <- prepped_recipe %>% bake(new_data=NULL)
final_model <- train_lasso_model(recipe_obj, processed_data)
train <- prepped_recipe %>% juice
sapply(train[, info_vars], class)
sapply(processed_data[, info_vars], class)
class(processed_data)
packageVersion("tune")
NOTE: I get the same error if I comment out step_unknown and step_dummy.