Hi everyone!
I think I might be missing something, so my apologies if this question is basic.
I've got a situation where I want to create a formula that contains 350 or so variables, in a dataset that contains 370 variables.
How do I construct the formula programmatically within recipes? Almost all the examples I see are of the form:
recipe(y ~ ., data = data)
However I don't want to add all the variables to the formula.
I notice in workflows that there is a function add_variables
, which pretty much does what I want, in that I get to specify:
add_variables(
workflow,
outcomes = my_outcome,
predictors = list_of_many_predictors
)
And I guess that is fine, but overall I want to fit the same type of model with two different engines, random forest, and boosting. Currently the setup looks like this
library(tidymodels)
tidymodels_prefer()
model_spec_xgb <- boost_tree(
tree_depth = 5,
trees = 100,
learn_rate = 0.001,
mtry = 0.7
) %>%
set_mode("regression") %>%
set_engine("xgboost")
model_spec_xgb
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 0.7
#> trees = 100
#> tree_depth = 5
#> learn_rate = 0.001
#>
#> Computational engine: xgboost
my_outcomes <- "y"
my_predictors <- rep(LETTERS, 10)
workflow_xgb <- workflow() %>%
add_model(spec = model_spec_xgb) %>%
add_variables(
outcomes = my_outcomes,
predictors = my_predictors
)
workflow_xgb
#> ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Preprocessor: Variables
#> Model: boost_tree()
#>
#> ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Outcomes: my_outcomes
#> Predictors: my_predictors
#>
#> ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Boosted Tree Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 0.7
#> trees = 100
#> tree_depth = 5
#> learn_rate = 0.001
#>
#> Computational engine: xgboost
model_spec_rf <- rand_forest(
mtry = 0.7,
trees = 1000,
min_n = 10
) %>%
set_mode("regression") %>%
set_engine("randomForest")
model_spec_rf
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 0.7
#> trees = 1000
#> min_n = 10
#>
#> Computational engine: randomForest
workflow_rf <- workflow() %>%
add_model(spec = model_spec_rf) %>%
add_variables(
outcomes = my_outcomes,
predictors = my_predictors
)
workflow_rf
#> ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Preprocessor: Variables
#> Model: rand_forest()
#>
#> ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Outcomes: my_outcomes
#> Predictors: my_predictors
#>
#> ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 0.7
#> trees = 1000
#> min_n = 10
#>
#> Computational engine: randomForest
Created on 2023-12-21 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 4.3.2 (2023-10-31)
#> os macOS Sonoma 14.0
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Australia/Brisbane
#> date 2023-12-21
#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date (UTC) lib source
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
#> broom * 1.0.5 2023-06-09 [1] CRAN (R 4.3.0)
#> cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0)
#> class 7.3-22 2023-05-03 [1] CRAN (R 4.3.2)
#> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1)
#> codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.2)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
#> conflicted 1.2.0 2023-02-01 [1] CRAN (R 4.3.0)
#> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.3.0)
#> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)
#> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.3.0)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
#> dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.0)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1)
#> fansi 1.0.5 2023-10-08 [1] CRAN (R 4.3.1)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.0)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)
#> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.3.0)
#> future 1.33.0 2023-07-01 [1] CRAN (R 4.3.0)
#> future.apply 1.11.0 2023-05-21 [1] CRAN (R 4.3.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
#> ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.1)
#> globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> gower 1.0.1 2022-12-22 [1] CRAN (R 4.3.0)
#> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.3.0)
#> gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)
#> hardhat 1.3.0 2023-03-30 [1] CRAN (R 4.3.0)
#> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.1)
#> infer * 1.0.5 2023-09-06 [1] CRAN (R 4.3.0)
#> ipred 0.9-14 2023-03-09 [1] CRAN (R 4.3.0)
#> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.0)
#> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1)
#> lattice 0.21-9 2023-10-01 [1] CRAN (R 4.3.2)
#> lava 1.7.2.1 2023-02-27 [1] CRAN (R 4.3.0)
#> lhs 1.1.6 2022-12-17 [1] CRAN (R 4.3.0)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
#> listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.0)
#> lubridate 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> MASS 7.3-60 2023-05-04 [1] CRAN (R 4.3.2)
#> Matrix 1.6-1.1 2023-09-18 [1] CRAN (R 4.3.2)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
#> modeldata * 1.2.0 2023-08-09 [1] CRAN (R 4.3.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
#> nnet 7.3-19 2023-05-03 [1] CRAN (R 4.3.2)
#> parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0)
#> parsnip * 1.1.1 2023-08-17 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
#> prodlim 2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
#> Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.0)
#> recipes * 1.0.8 2023-08-25 [1] CRAN (R 4.3.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)
#> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.1)
#> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
#> rpart 4.1.21 2023-10-09 [1] CRAN (R 4.3.2)
#> rsample * 1.2.0 2023-08-23 [1] CRAN (R 4.3.0)
#> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
#> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> styler 1.9.1 2023-03-04 [1] CRAN (R 4.3.0)
#> survival 3.5-7 2023-08-14 [1] CRAN (R 4.3.2)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> tidymodels * 1.1.1 2023-08-24 [1] CRAN (R 4.3.0)
#> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
#> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
#> timeDate 4022.108 2023-01-07 [1] CRAN (R 4.3.0)
#> tune * 1.1.2 2023-08-23 [1] CRAN (R 4.3.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
#> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1)
#> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.0)
#> workflowsets * 1.0.1 2023-04-06 [1] CRAN (R 4.3.0)
#> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.1)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1)
#> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#>
#> ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
But when I try and combine these into a workflowset, like so, I get an error:
workflow_set(
models = list(
xgb = model_spec_xgb,
rf = model_spec_rf
)
)
#> Error in workflow_set(models = list(xgb = model_spec_xgb, rf = model_spec_rf)): argument "preproc" is missing, with no default
And then when I try and set the prepoc
argument like so, I get another error:
workflow_set(
preproc = workflow_variables(
outcomes = my_outcomes,
predictors = my_predictors
),
models = list(
xgb = model_spec_xgb,
rf = model_spec_rf
)
)
#> Error in `tidyr::crossing()`:
#> ! `..1` must be a vector, not a <workflow_variables> object.
#> Backtrace:
#> β
#> 1. ββworkflowsets::workflow_set(...)
#> 2. β ββworkflowsets:::cross_objects(preproc, models)
#> 3. β ββ... %>% dplyr::select(wflow_id, preproc, model = models)
#> 4. β ββtidyr::crossing(preproc, models)
#> 5. β ββtidyr:::grid_dots(...)
#> 6. β ββvctrs::vec_assert(dot, arg = arg, call = .error_call)
#> 7. β ββvctrs:::stop_scalar_type(x, arg, call = call)
#> 8. β ββvctrs:::stop_vctrs(...)
#> 9. β ββrlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
#> 10. ββdplyr::select(., wflow_id, preproc, model = models)
#> 11. ββdplyr::mutate(., wflow_id = paste(pp_nm, mod_nm, sep = "_"))
#> 12. ββdplyr::mutate(., pp_nm = names(preproc), mod_nm = names(models))
I'd like to take advantage of workflowsets
, but that requires some preprocessing step, which I can't seem to incorporate given what my workflow currently looks like. I am fairly certain I'm just missing some basic step with recipes, but I'm kind of new to the tidymodels world, so any help would be much appreciated!