How to make large formula with recipes

njtierney · December 21, 2023, 12:26am

Hi everyone!

I think I might be missing something, so my apologies if this question is basic.

I've got a situation where I want to create a formula that contains 350 or so variables, in a dataset that contains 370 variables.

How do I construct the formula programmatically within recipes? Almost all the examples I see are of the form:

recipe(y ~ ., data = data)

However I don't want to add all the variables to the formula.

I notice in workflows that there is a function add_variables, which pretty much does what I want, in that I get to specify:

add_variables(
  workflow,
  outcomes = my_outcome,
  predictors = list_of_many_predictors
)

And I guess that is fine, but overall I want to fit the same type of model with two different engines, random forest, and boosting. Currently the setup looks like this

library(tidymodels)
tidymodels_prefer()

model_spec_xgb <- boost_tree(
  tree_depth = 5,
  trees = 100,
  learn_rate = 0.001,
  mtry = 0.7
) %>%
  set_mode("regression") %>%
  set_engine("xgboost")

model_spec_xgb
#> Boosted Tree Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 0.7
#>   trees = 100
#>   tree_depth = 5
#>   learn_rate = 0.001
#> 
#> Computational engine: xgboost

my_outcomes <- "y"
my_predictors <- rep(LETTERS, 10)

workflow_xgb <- workflow() %>%
  add_model(spec = model_spec_xgb) %>%
  add_variables(
    outcomes = my_outcomes,
    predictors = my_predictors
  )

workflow_xgb
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Variables
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Outcomes: my_outcomes
#> Predictors: my_predictors
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Boosted Tree Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 0.7
#>   trees = 100
#>   tree_depth = 5
#>   learn_rate = 0.001
#> 
#> Computational engine: xgboost

model_spec_rf <- rand_forest(
  mtry = 0.7,
  trees = 1000,
  min_n = 10
) %>%
  set_mode("regression") %>%
  set_engine("randomForest")

model_spec_rf
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 0.7
#>   trees = 1000
#>   min_n = 10
#> 
#> Computational engine: randomForest

workflow_rf <- workflow() %>%
  add_model(spec = model_spec_rf) %>%
  add_variables(
    outcomes = my_outcomes,
    predictors = my_predictors
  )

workflow_rf
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Variables
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Outcomes: my_outcomes
#> Predictors: my_predictors
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 0.7
#>   trees = 1000
#>   min_n = 10
#> 
#> Computational engine: randomForest

^{Created on 2023-12-21 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.0
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Australia/Brisbane
#>  date     2023-12-21
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  backports      1.4.1      2021-12-13 [1] CRAN (R 4.3.0)
#>  broom        * 1.0.5      2023-06-09 [1] CRAN (R 4.3.0)
#>  cachem         1.0.8      2023-05-01 [1] CRAN (R 4.3.0)
#>  class          7.3-22     2023-05-03 [1] CRAN (R 4.3.2)
#>  cli            3.6.2      2023-12-11 [1] CRAN (R 4.3.1)
#>  codetools      0.2-19     2023-02-01 [1] CRAN (R 4.3.2)
#>  colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
#>  conflicted     1.2.0      2023-02-01 [1] CRAN (R 4.3.0)
#>  data.table     1.14.8     2023-02-17 [1] CRAN (R 4.3.0)
#>  dials        * 1.2.0      2023-04-03 [1] CRAN (R 4.3.0)
#>  DiceDesign     1.9        2021-02-13 [1] CRAN (R 4.3.0)
#>  digest         0.6.33     2023-07-07 [1] CRAN (R 4.3.0)
#>  dplyr        * 1.1.3      2023-09-03 [1] CRAN (R 4.3.0)
#>  evaluate       0.23       2023-11-01 [1] CRAN (R 4.3.1)
#>  fansi          1.0.5      2023-10-08 [1] CRAN (R 4.3.1)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
#>  foreach        1.5.2      2022-02-02 [1] CRAN (R 4.3.0)
#>  fs             1.6.3      2023-07-20 [1] CRAN (R 4.3.0)
#>  furrr          0.3.1      2022-08-15 [1] CRAN (R 4.3.0)
#>  future         1.33.0     2023-07-01 [1] CRAN (R 4.3.0)
#>  future.apply   1.11.0     2023-05-21 [1] CRAN (R 4.3.0)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
#>  ggplot2      * 3.4.4      2023-10-12 [1] CRAN (R 4.3.1)
#>  globals        0.16.2     2022-11-21 [1] CRAN (R 4.3.0)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.3.0)
#>  gower          1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
#>  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.3.0)
#>  gtable         0.3.4      2023-08-21 [1] CRAN (R 4.3.0)
#>  hardhat        1.3.0      2023-03-30 [1] CRAN (R 4.3.0)
#>  htmltools      0.5.7      2023-11-03 [1] CRAN (R 4.3.1)
#>  infer        * 1.0.5      2023-09-06 [1] CRAN (R 4.3.0)
#>  ipred          0.9-14     2023-03-09 [1] CRAN (R 4.3.0)
#>  iterators      1.0.14     2022-02-05 [1] CRAN (R 4.3.0)
#>  knitr          1.45       2023-10-30 [1] CRAN (R 4.3.1)
#>  lattice        0.21-9     2023-10-01 [1] CRAN (R 4.3.2)
#>  lava           1.7.2.1    2023-02-27 [1] CRAN (R 4.3.0)
#>  lhs            1.1.6      2022-12-17 [1] CRAN (R 4.3.0)
#>  lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
#>  listenv        0.9.0      2022-12-16 [1] CRAN (R 4.3.0)
#>  lubridate      1.9.2      2023-02-10 [1] CRAN (R 4.3.0)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
#>  MASS           7.3-60     2023-05-04 [1] CRAN (R 4.3.2)
#>  Matrix         1.6-1.1    2023-09-18 [1] CRAN (R 4.3.2)
#>  memoise        2.0.1      2021-11-26 [1] CRAN (R 4.3.0)
#>  modeldata    * 1.2.0      2023-08-09 [1] CRAN (R 4.3.0)
#>  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.3.0)
#>  nnet           7.3-19     2023-05-03 [1] CRAN (R 4.3.2)
#>  parallelly     1.36.0     2023-05-26 [1] CRAN (R 4.3.0)
#>  parsnip      * 1.1.1      2023-08-17 [1] CRAN (R 4.3.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
#>  prodlim        2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
#>  purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils        2.12.2     2022-11-11 [1] CRAN (R 4.3.0)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
#>  Rcpp           1.0.11     2023-07-06 [1] CRAN (R 4.3.0)
#>  recipes      * 1.0.8      2023-08-25 [1] CRAN (R 4.3.0)
#>  reprex         2.0.2      2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang          1.1.2      2023-11-04 [1] CRAN (R 4.3.1)
#>  rmarkdown      2.25       2023-09-18 [1] CRAN (R 4.3.1)
#>  rpart          4.1.21     2023-10-09 [1] CRAN (R 4.3.2)
#>  rsample      * 1.2.0      2023-08-23 [1] CRAN (R 4.3.0)
#>  rstudioapi     0.15.0     2023-07-07 [1] CRAN (R 4.3.0)
#>  scales       * 1.2.1      2022-08-20 [1] CRAN (R 4.3.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
#>  styler         1.9.1      2023-03-04 [1] CRAN (R 4.3.0)
#>  survival       3.5-7      2023-08-14 [1] CRAN (R 4.3.2)
#>  tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
#>  tidymodels   * 1.1.1      2023-08-24 [1] CRAN (R 4.3.0)
#>  tidyr        * 1.3.0      2023-01-24 [1] CRAN (R 4.3.0)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
#>  timechange     0.2.0      2023-01-11 [1] CRAN (R 4.3.0)
#>  timeDate       4022.108   2023-01-07 [1] CRAN (R 4.3.0)
#>  tune         * 1.1.2      2023-08-23 [1] CRAN (R 4.3.0)
#>  utf8           1.2.4      2023-10-22 [1] CRAN (R 4.3.1)
#>  vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.3.1)
#>  withr          2.5.2      2023-10-30 [1] CRAN (R 4.3.1)
#>  workflows    * 1.1.3      2023-02-22 [1] CRAN (R 4.3.0)
#>  workflowsets * 1.0.1      2023-04-06 [1] CRAN (R 4.3.0)
#>  xfun           0.41       2023-11-01 [1] CRAN (R 4.3.1)
#>  yaml           2.3.8      2023-12-11 [1] CRAN (R 4.3.1)
#>  yardstick    * 1.2.0      2023-04-21 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

But when I try and combine these into a workflowset, like so, I get an error:

workflow_set(
  models = list(
    xgb = model_spec_xgb,
    rf = model_spec_rf
  )
)
#> Error in workflow_set(models = list(xgb = model_spec_xgb, rf = model_spec_rf)): argument "preproc" is missing, with no default

And then when I try and set the prepoc argument like so, I get another error:

workflow_set(
  preproc = workflow_variables(
    outcomes = my_outcomes,
    predictors = my_predictors
  ),
  models = list(
    xgb = model_spec_xgb,
    rf = model_spec_rf
  )
)
#> Error in `tidyr::crossing()`:
#> ! `..1` must be a vector, not a <workflow_variables> object.
#> Backtrace:
#>      ▆
#>   1. ├─workflowsets::workflow_set(...)
#>   2. │ └─workflowsets:::cross_objects(preproc, models)
#>   3. │   ├─... %>% dplyr::select(wflow_id, preproc, model = models)
#>   4. │   └─tidyr::crossing(preproc, models)
#>   5. │     └─tidyr:::grid_dots(...)
#>   6. │       └─vctrs::vec_assert(dot, arg = arg, call = .error_call)
#>   7. │         └─vctrs:::stop_scalar_type(x, arg, call = call)
#>   8. │           └─vctrs:::stop_vctrs(...)
#>   9. │             └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
#>  10. ├─dplyr::select(., wflow_id, preproc, model = models)
#>  11. ├─dplyr::mutate(., wflow_id = paste(pp_nm, mod_nm, sep = "_"))
#>  12. └─dplyr::mutate(., pp_nm = names(preproc), mod_nm = names(models))

I'd like to take advantage of workflowsets, but that requires some preprocessing step, which I can't seem to incorporate given what my workflow currently looks like. I am fairly certain I'm just missing some basic step with recipes, but I'm kind of new to the tidymodels world, so any help would be much appreciated!

Max · December 21, 2023, 1:40pm

I suggest using recipe(y ~ ., data = data) and changing the roles of the extra columns to something besides "output" or "predictor".

This will keep all of the columns are during the entire process but not use them in any models (because of their roles). For example, they will be there if/when you use augment() and so on.

njtierney · December 21, 2023, 11:56pm

Thanks, @Max ! That makes sense to me. I can use tidyselect selector functions to help choose those so I don't have to type out the 30-50 variables by hand, which is the thing I wanted to avoid doing!

system · December 28, 2023, 11:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.