tidymodels Recipe Sequence

lyndess · April 27, 2024, 4:39am

I am student of a data analytics program with a general question regarding the "typical" sequence of steps in a tidymodels recipe. ("Typical" is in quotes, of course, to acknowledge that every data set and analytic problem may require a unique approach.) The tidymodels site provides some good guidance on ordering recipe steps: Ordering of steps • recipes, which are:

1. Impute
2. Handle factor levels
3. Individual transformations for skewness and other issues
4. Discretize (if needed and if you have no other choice)
5. Create dummy variables
6. Create interactions
7. Normalization steps (center, scale, range, etc)
8. Multivariate transformation (e.g. PCA, spatial sign, etc)

I'm assuming that a "step 0" might precede these and include things such as removing NA values (step_naomit), assigning special roles (update_role), or possibly other custom selection (step_select) or exclusion (step_rm) lines.

What about other filtering steps such as:

Removing highly-correlated variables with step_corr()
Removing zero-variance (step_zv) and near-zero-variance (step_nzv) predictors
Removing features algorithmically (such as with steps provided by the colino package (GitHub - stevenpawley/colino: Recipes Steps for Supervised Filter-Based Feature Selection)

Where should these generally fall in the sequence? My initial thinking was that these would go at the beginning (to avoid unnecessary transformations of variables that are being removed), but then I read somewhere that step_zv should follow the handling of factor levels (step 2) in case such handling results in zero-variance predictors.

Also, what about upsampling/downsampling the training data set with step_upsample(), step_downsample(), etc.?

I'm hoping to gain some additional insight that will help me better sequence my recipes, so any suggestions are welcome. Thank you in advance for any guidance you can provide!

lyndess · June 1, 2024, 1:28pm

Based on additional research, here's what I came up with as a generic starting template (noting that not every problem would use every step, and not necessarily in this order):

Define custom roles for variables: update_role()
Preliminary feature selection/removal
a. manually select/remove: step_select() and/or step_rm()
b. remove zero variance/near-zero variance predictors: step_zv() and/or step_nzv()
c. remove highly-correlated features: step_corr()
Observation removal/filtering and imputation
a. remove rows with missing values: step_naomit()
b. remove observations with extreme outliers: ???
c. Impute missing values (various methods): step_impute_*()
Quantitative variable transformations
a. Transform for skewness or other issues: step_log(), step_sqrt(), step_boxcox(), step_YeoJohnson(), etc.
b. Discretize continuous variables (if needed and if you have no other choice): step_undisc() or step_disc()
Categorical variable transformations
a. Handle factor levels: step_other()
b. Create dummy variables: step_dummy()
c. Remove zero variance/near-zero variance predictors AGAIN (after creating dummy variables): step_zv() and/or step_nzv()
Creation of interaction terms: step_interact(), step_poly(), etc.
Scale/normalize numeric data (which now includes dummy variables): step_normalize(), step_center(), step_scale(), step_range(), etc.
Algorithmic feature selection (performed manually or with functions from the colino package, for example): step_select_roc(), step_select_vip(), etc.
Multivariate transformation: step_pca(), step_pls(), etc.
Upsample/downsample data to address imbalance: step_upsample(), step_downsample(), step_ovun(), etc.

Critiques and suggestions are absolutely welcome!

system · June 8, 2024, 4:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.