Based on additional research, here's what I came up with as a generic starting template (noting that not every problem would use every step, and not necessarily in this order):
- Define custom roles for variables:
update_role()
- Preliminary feature selection/removal
a. manually select/remove:step_select()
and/orstep_rm()
b. remove zero variance/near-zero variance predictors:step_zv()
and/orstep_nzv()
c. remove highly-correlated features:step_corr()
- Observation removal/filtering and imputation
a. remove rows with missing values:step_naomit()
b. remove observations with extreme outliers: ???
c. Impute missing values (various methods):step_impute_*()
- Quantitative variable transformations
a. Transform for skewness or other issues:step_log()
,step_sqrt()
,step_boxcox()
,step_YeoJohnson()
, etc.
b. Discretize continuous variables (if needed and if you have no other choice):step_undisc()
orstep_disc()
- Categorical variable transformations
a. Handle factor levels:step_other()
b. Create dummy variables:step_dummy()
c. Remove zero variance/near-zero variance predictors AGAIN (after creating dummy variables):step_zv()
and/orstep_nzv()
- Creation of interaction terms:
step_interact()
,step_poly()
, etc. - Scale/normalize numeric data (which now includes dummy variables):
step_normalize()
,step_center()
,step_scale()
,step_range()
, etc. - Algorithmic feature selection (performed manually or with functions from the colino package, for example):
step_select_roc()
,step_select_vip()
, etc. - Multivariate transformation:
step_pca()
,step_pls()
, etc. - Upsample/downsample data to address imbalance:
step_upsample()
,step_downsample()
,step_ovun()
, etc.
Critiques and suggestions are absolutely welcome!