Hi There Community,
I've got a binary classification problem involving an imbalanced dataset where I would like to find a high performing model via 5 fold cross validation + a grid search via the tune_grid() method in the tune package of Tidymodels providing arguments for the preprocessor as a single recipe object, the resamples generated from vfold_cv and so on.
To help with the imbalance, I would like to apply Synthetic Minority Over-sampling Technique (SMOTE) via the Themis package in tidymodels by including step_smote(Class) in my preprocessing recipe.
However, I am a bit concerned about how tune_grid() manages the execution of SMOTE to prevent data leakage if the resamples is a tibble of folds from vfold_cv.
My understanding is that in K-Fold Cross Validation we should ensure that data leakage does not occur even at the fold level.
I also understand that Tidymodels is set up to prevent data leakage (I'm coming from the Scikit learn ecosystem so I'm not familiar with caret and came straight into Tidymodels).
Thus would it be safe to assume that any dependency-causing data transformations such as SMOTE will automatically limit itself to independently apply itself the set of training data and again independently to the set of test data within each fold to prevent data leakage within each fold?
P.S. If it helps this is similar to Caret Documentation's 11.2 Subsampling During Resampling discussion.
Thank you for your time and kind advice,
Ben