Cross validation: understanding the process and implementation

Hi all,

I am trying to make sure I understand the process of doing cross validation correctly and know what goes through under the hood when running the fit_resamples function in R. Here are my questions:

  1. Is it correct that once the data have been processed in some way (normalized, standardized, logged, etc.), then when we do cross validation, we want to create the folds based on the data that have been processed, not the original data?

  2. I have the following code:

library(tidymodels)

main_data <- mtcars

# Split into training and test sets
set.seed(123)
data.split <- initial_split(main_data, prop=.75)
train_data <- training(data.split)
test_data <- testing(data.split)

# Create a linear model
lm_model <- linear_reg() %>% set_engine(“lm”)

# Create a recipe from original data
main_recipe <- recipe(mpg ~., data = train_data) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal_predictors())

# Putting everything into a workflow
main_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(main_recipe)

# Create processed training data
processed_data <- main_recipe %>% prep() %>% juice()

# Create the folds
cv_data <- vfold_cv(processed_data, v = 10)

# Operation 1
resample_1 <- main_workflow %>%
fit_resamples(cv_data)

# Operation 2
resample_2 <- lm_model %>%
fit_resamples(main_recipe, cv_data)

# Operation 3
resample_3 <- lm_model %>%
fit_resamples(mpg~., cv_data)

collect_metrics(resample_1)
collect_metrics(resample_2)
collect_metrics(resample_3)

# Alternatively: creating folds using original, unprocessed training data
cv_data2 <- vfold_cv(train_data, v = 10)

# Operation 4
resample_4 <- lm_model %>% 
  fit_resamples(mpg~., cv_data2)

collect_metrics(resample_4)

So in this code I used the processed data to create the folds (creating the folds from process_data into an object named cv_data). When I looked at the metrics the three operations gave me the exact same numbers. Is any of those three operations valid when evaluating the model's performance using resampling? Is there a way to do it without having to prep and bake (or juice) the original training data first?

In operation 4 I used the original, unprocessed data to create the folds and got different results from the first three just for illustration purposes.

Thanks!

The other way around. Give the resampling functions the original data and don't do any processing manually.

Let the tune and workflows functions to the preprocessing. You should almost never have to use prep() or bake() when modeling.

If you don't the resampling procedure does not produce accurate results since the preprocessing (which often involves some sort of estimation) variation would be ignored. The proper method is to have the resampling procedure repeat the preprocessing within each fold. This is what caret, mlr, mlr3 and others do and it is what tidymodels does too.

That is the correct approach.

1 Like

The proper method is to have the resampling procedure repeat the preprocessing within each fold.

Ah this is the most enlightening thing I've read about this! Essentially just like you don't want the test set to contaminate your training set, you also don't want information from the assessment set to leak into your analysis set. Is that correct?

Then I have a follow up question about fit_resamples:

First, just to confirm my understanding, so that means when I run either model_spec %>% fit_resamples(recipe, resamples, ...) OR workflow %>% fit_resamples( resamples, ...), what fit_resamples does is it takes the recipe fed to it (either implicitly in the workflow or explicitly as an argument) and conducts all the preprocessing steps to each assessment set of each fold and then fits the model on the processed data. Is that right?

Second, when you run model_spec %>% fit_resamples(formula, resamples, ...), basically since there's no recipe involved, fit_resamples will just do cross validation on the original data with no preprocessing. Is this correct?

Thanks Max!

Yes.

Yes, also correct.

No, we recreate the model matrix from the formula each time since we can't assume that all the factor levels are present, etc. In other words, we treat all methods of preprocessing equally.

Thanks Max. Another follow-up if you don't mind:

So I'm trying to use workflow_set() to create a bunch of workflows and workflow_map() to conduct resampling on them. I have the following code:

library(tidymodels)
library(ranger)

# Split into training and test sets
set.seed(123)
data.split <- initial_split(mtcars, prop=.75)
train_data <- training(data.split)
test_data <- testing(data.split)

# Create a linear model
lm_model <- linear_reg() %>% set_engine("lm")

# Create a random forest model
rf_model <- rand_forest(trees = 1000) %>%   set_engine("ranger") %>% set_mode("regression")

# Create recipes
recipe1 <- recipe(mpg ~ ., data = train_data) %>%   step_dummy(all_nominal_predictors())

recipe2 <- recipe(mpg ~ wt, data = train_data) %>%  step_dummy(all_nominal_predictors())

recipe3 <- recipe(mpg ~ wt + hp, data = train_data) %>% step_dummy(all_nominal_predictors())

# Create list of models
model_list <- list(rand_forest = rf_model, lm = lm_model)
recipe_list <- list(all = recipe1, wt_only = recipe2, wt_hp = recipe3)

# Create workflows
workflows_combo <- workflow_set(preproc = recipe_list, models = model_list, cross = TRUE)

# Creating folds using original, unprocessed training data
cv_data <- vfold_cv(train_data, v = 10)

# Resample on each workflow object
resampling_result <- workflows_combo %>%
  workflow_map("fit_resamples", seed = 1101, verbose = TRUE, resamples = cv_data)

collect_metrics(resampling_result) %>%   filter(.metric == "rsq")
collect_metrics(resampling_result) %>%   filter(.metric == "rmse")

As you can see in this code, I created the folds almost at the end, after I had created the workflows with workflow_set() and before running workflow_map(). As I ran workflow_map there was a message "Fold05: internal: A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 error. NA will be returned." I did google the warning message and it took me to a discussion on Github by you and others that I wasn't sure I understood.

That said, when I ran the code above with a slight modification where I created the fold right after I split the data into training and test set and before doing the rest of the stuff, I no longer got the warning message.

Would you mind explaining why it matters when the folds are created? And also what is the "correct" procedure that one should follow to get accurate results? Thanks so much!

This warning occurs when the model predicts a single unique value for all of the data points. That means that the variance of the predicted values is zero and that crashes R2 calculations. It basically means that the model is awful and can be ignored. There is some documentation about it here.

The model has variation and different data set will produce different models. In your case, some data sets produce the problem described above and some are not. It doesn't mean. that some resamples are "good" and others are "bad". They are being used to measure model performance and are doing the right thing.

My advice is to always set the seed right be before you use random numbers (e.g. before vfold_cv() or initial_split()).

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.