I am student of a data analytics program with a general question regarding the "typical" sequence of steps in a tidymodels recipe. ("Typical" is in quotes, of course, to acknowledge that every data set and analytic problem may require a unique approach.) The tidymodels site provides some good guidance on ordering recipe steps: Ordering of steps • recipes, which are:
1. Impute
2. Handle factor levels
3. Individual transformations for skewness and other issues
4. Discretize (if needed and if you have no other choice)
5. Create dummy variables
6. Create interactions
7. Normalization steps (center, scale, range, etc)
8. Multivariate transformation (e.g. PCA, spatial sign, etc)
I'm assuming that a "step 0" might precede these and include things such as removing NA values (step_naomit), assigning special roles (update_role), or possibly other custom selection (step_select) or exclusion (step_rm) lines.
What about other filtering steps such as:
- Removing highly-correlated variables with step_corr()
- Removing zero-variance (step_zv) and near-zero-variance (step_nzv) predictors
- Removing features algorithmically (such as with steps provided by the colino package (GitHub - stevenpawley/colino: Recipes Steps for Supervised Filter-Based Feature Selection)
Where should these generally fall in the sequence? My initial thinking was that these would go at the beginning (to avoid unnecessary transformations of variables that are being removed), but then I read somewhere that step_zv should follow the handling of factor levels (step 2) in case such handling results in zero-variance predictors.
Also, what about upsampling/downsampling the training data set with step_upsample(), step_downsample(), etc.?
I'm hoping to gain some additional insight that will help me better sequence my recipes, so any suggestions are welcome. Thank you in advance for any guidance you can provide!