Based on additional research, here's what I came up with as a generic starting template (noting that not every problem would use every step, and not necessarily in this order):
- Define custom roles for variables:
update_role() - Preliminary feature selection/removal
a. manually select/remove:step_select()and/orstep_rm()
b. remove zero variance/near-zero variance predictors:step_zv()and/orstep_nzv()
c. remove highly-correlated features:step_corr() - Observation removal/filtering and imputation
a. remove rows with missing values:step_naomit()
b. remove observations with extreme outliers: ???
c. Impute missing values (various methods):step_impute_*() - Quantitative variable transformations
a. Transform for skewness or other issues:step_log(),step_sqrt(),step_boxcox(),step_YeoJohnson(), etc.
b. Discretize continuous variables (if needed and if you have no other choice):step_undisc()orstep_disc() - Categorical variable transformations
a. Handle factor levels:step_other()
b. Create dummy variables:step_dummy()
c. Remove zero variance/near-zero variance predictors AGAIN (after creating dummy variables):step_zv()and/orstep_nzv() - Creation of interaction terms:
step_interact(),step_poly(), etc. - Scale/normalize numeric data (which now includes dummy variables):
step_normalize(),step_center(),step_scale(),step_range(), etc. - Algorithmic feature selection (performed manually or with functions from the colino package, for example):
step_select_roc(),step_select_vip(), etc. - Multivariate transformation:
step_pca(),step_pls(), etc. - Upsample/downsample data to address imbalance:
step_upsample(),step_downsample(),step_ovun(), etc.
Critiques and suggestions are absolutely welcome!