Recipes cross validation with scaling and centering of features

john.smith · March 22, 2018, 10:20am

Hi,

I'm trying to make recipes a part of my work flow and have two questions i was hoping to get help with. One is more practical and the other more theoretical.

Taking the example from the main website, How do i actually see the new dataset after the transformation based on my recipes? This is more for my piece of mind to explore afterwards

library(recipes)
library(mlbench)
data(Sonar)
sonar_rec <- recipe(Class ~ ., data = Sonar) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors())

The second question is a bit more theory related and might be from my lack of understanding of
building model predictors

Lets imagine i have a dataset which i split into a training and a test set. My understanding of centering on a very high level is we subtract the variable mean from each of the scores to produce the new variable score (I know there is a bit more to it ) .

If we use the entire dataset to do this it will produce one set of scaled variables. We then split the data into training and test set and train the model on the train data and test the validity on the test set, Does this inadvertently give information to my model because we use the overall mean in the scaling of the numeric features?

If we go the other way and apply the scaling and centering on the group level (in this case training group mean and test group mean), the group means have the potential to be different. If I take this one step further and use cross validation there is a potential to be k*2 different means depending on the number slices (10k validation would have ten training sets and ten test sets each with their own unique means)

I guess my question after all that is, what are the steps to be used and if the above is anything to worry about at all or if I'm just over-caffeinated

Thank you all for your help

mishabalyasin · March 22, 2018, 11:42am

I don't know about the first question, but second sounds like I can help you with.

At the link you yourself provided you can find following text:

If this is the only preprocessing steps for the predictors, we can now estimate the means and standard deviations from the training set. The prep function is used with a recipe and a data set:

trained_rec <- prep(standardized, training = seg_train)

Now that the statistics have been estimated, the preprocessing can be applied to the training and test set:

train_data <- bake(trained_rec, newdata = seg_train)
test_data  <- bake(trained_rec, newdata = seg_test)

So you intuition about not leaking information about test set is correct. You estimate means and standard deviations from train set and then use them for test set. This way you can be certain that you are not using information you are not supposed to have.

As for your further question about potentiality of means being different -- well, that's why you are a data scientist, isn't it ? It is up to you to monitor the data that comes in and make sure that there is no funny business happening. There are multiple packages that can help you with this where you can set up your expectations about the data you get to make sure it conforms to those expectations. But the rule of thumb stays the same - whenever you want unbiased estimates of performance you must never use any data from the test set.

Max · March 22, 2018, 1:03pm

To echo @mishabalyasin's answer, prep is used to estimate things (like the means) from the data set and bake is used to apply the preprocessing steps to any data set.

In resampling, you would repeatedly estimate means for each resample. This is how resampling procedures estimate the variability of the model's performance. These means, and any associated models, are only used to estimate performance and are discarded after that is done.

In practice, you would estimate all of your preprocessing using the training set (which is supplied to the recipe
and prep functions) and then apply this to all data sets (e.g. training, test, new unknown samples, etc) via bake. This is really important; no other data should be used to inform the preprocessing or modeling (google "information leakage").

There are examples of using recipes with resampling in this article as well as in the conference workshop notes (see parts 2 and 3). You can do this in caret too since train can take a recipe as input.

john.smith · March 22, 2018, 1:08pm

Hi @Max @mishabalyasin

Thank you both for your your very clear explanations