Concerns about How Data Leakage is Managed using Tidymodel's Thermis Package

benphua · August 18, 2020, 2:36pm

Hi There Community,

I've got a binary classification problem involving an imbalanced dataset where I would like to find a high performing model via 5 fold cross validation + a grid search via the tune_grid() method in the tune package of Tidymodels providing arguments for the preprocessor as a single recipe object, the resamples generated from vfold_cv and so on.

To help with the imbalance, I would like to apply Synthetic Minority Over-sampling Technique (SMOTE) via the Themis package in tidymodels by including step_smote(Class) in my preprocessing recipe.

However, I am a bit concerned about how tune_grid() manages the execution of SMOTE to prevent data leakage if the resamples is a tibble of folds from vfold_cv.

My understanding is that in K-Fold Cross Validation we should ensure that data leakage does not occur even at the fold level.

I also understand that Tidymodels is set up to prevent data leakage (I'm coming from the Scikit learn ecosystem so I'm not familiar with caret and came straight into Tidymodels).

Thus would it be safe to assume that any dependency-causing data transformations such as SMOTE will automatically limit itself to independently apply itself the set of training data and again independently to the set of test data within each fold to prevent data leakage within each fold?

P.S. If it helps this is similar to Caret Documentation's 11.2 Subsampling During Resampling discussion.

Thank you for your time and kind advice,

Ben

benphua · August 18, 2020, 3:50pm

Doh, I think I've solved my own problem, found the Tidy Models tutorial on this specific topic: https://www.tidymodels.org/learn/models/sub-sampling/

I'll leave this here for anyone else who comes across this issue in their learning journey!

Max · August 18, 2020, 3:52pm

Just like caret, themis/recipes will

Only apply the extra sampling routine to the data used for modeling (aka the analysis set in our lingo)
The corresponding data used for prediction (the assessment set) are not directly affected by the sampling routine.

It is safe to use these sub-sampling recipe steps within tune_*() functions.

benphua · August 18, 2020, 3:58pm

Wonderful, thank you Max, and thank you and your team for working on these amazing packages have a great/ safe day!

system · August 25, 2020, 3:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.