tidymodels Recipe Sequence

lyndess · June 1, 2024, 1:28pm

Based on additional research, here's what I came up with as a generic starting template (noting that not every problem would use every step, and not necessarily in this order):

Define custom roles for variables: update_role()
Preliminary feature selection/removal
a. manually select/remove: step_select() and/or step_rm()
b. remove zero variance/near-zero variance predictors: step_zv() and/or step_nzv()
c. remove highly-correlated features: step_corr()
Observation removal/filtering and imputation
a. remove rows with missing values: step_naomit()
b. remove observations with extreme outliers: ???
c. Impute missing values (various methods): step_impute_*()
Quantitative variable transformations
a. Transform for skewness or other issues: step_log(), step_sqrt(), step_boxcox(), step_YeoJohnson(), etc.
b. Discretize continuous variables (if needed and if you have no other choice): step_undisc() or step_disc()
Categorical variable transformations
a. Handle factor levels: step_other()
b. Create dummy variables: step_dummy()
c. Remove zero variance/near-zero variance predictors AGAIN (after creating dummy variables): step_zv() and/or step_nzv()
Creation of interaction terms: step_interact(), step_poly(), etc.
Scale/normalize numeric data (which now includes dummy variables): step_normalize(), step_center(), step_scale(), step_range(), etc.
Algorithmic feature selection (performed manually or with functions from the colino package, for example): step_select_roc(), step_select_vip(), etc.
Multivariate transformation: step_pca(), step_pls(), etc.
Upsample/downsample data to address imbalance: step_upsample(), step_downsample(), step_ovun(), etc.

Critiques and suggestions are absolutely welcome!