What is the purpose of `recipes::update_role()` (within the context of `hardhat >= 1.0.0`)

mdancho · June 27, 2022, 9:09pm

I'm having a tough time understanding the purpose of the update_role() function with the changes within the context of hardhat >= 1.0.0.

Previously, changing the role to something like "indicator" would shield the machine learning model from seeing the column.

That behavior changed in hardhat >= 1.0.0, which is now making some of my code break.

In the recipes documentation for update_role(), I see:

library(recipes)
library(modeldata)
data(biomass)

# However `sample` and `dataset` aren't predictors. Since they already have
# roles, `update_role()` can be used to make changes, to any arbitrary role:
recipe(HHV ~ ., data = biomass) %>%
  update_role(sample, new_role = "id variable") %>%
  update_role(dataset, new_role = "splitting variable") %>%
  summary()

But now the machine learning model would see the "id variable" and "splitting variable", which doesn't make sense, or does it?

Anyways, I'm just trying to clear up my understanding of the behavior.

Thanks!

Max · June 27, 2022, 10:18pm

Hey Matt,

That should not be the case.

When passing the data to the model, only the predictors and outcomes should be exposed to the modeling function. Please file an issue as soon as you can if this is not the case since I will be sending a new recipes (and hardhat) to CRAN very soon.

For your example:

library(tidymodels)
library(hardhat) # I have hardhat_1.1.0 from CRAN
tidymodels_prefer()
data(biomass)

# However `sample` and `dataset` aren't predictors. Since they already have
# roles, `update_role()` can be used to make changes, to any arbitrary role:
rec <- 
  recipe(HHV ~ ., data = biomass) %>%
  update_role(sample, new_role = "id variable") %>%
  update_role(dataset, new_role = "splitting variable")

summary(rec)
#> # A tibble: 8 × 4
#>   variable type    role               source  
#>   <chr>    <chr>   <chr>              <chr>   
#> 1 sample   nominal id variable        original
#> 2 dataset  nominal splitting variable original
#> 3 carbon   numeric predictor          original
#> 4 hydrogen numeric predictor          original
#> 5 oxygen   numeric predictor          original
#> 6 nitrogen numeric predictor          original
#> 7 sulfur   numeric predictor          original
#> 8 HHV      numeric outcome            original

wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(linear_reg())

wflow_fit <- fit(wflow, data = biomass)

# it shopuld only get the predictors and outcomes
wflow_fit %>% extract_fit_engine() %>% coef() %>% names()
#> [1] "(Intercept)" "carbon"      "hydrogen"    "oxygen"      "nitrogen"   
#> [6] "sulfur"

^{Created on 2022-06-27 by the reprex package (v2.0.1)}

The point of the non-standard roles: you can keep certain columns around in your data without them being used in the model. After you fit the model, you might want these around to troubleshoot poor predictions, make plots, or anything else. That is still the goal.

What changed in hardhat 1.1.0?

Just to recap what is currently happening. The main change to hardhat was related to our addition of case weight tools across the tidymodels packages.

With case weights, we needed a method to determine what columns need to be available when bake() is used.

With the case weight change, we needed to address non-standard roles (e.g. not predictor or outcome). Our first attempt resulted in a number of breakages (which you thankfully reported).

We have a better solution in PR that is easier for users and will break fewer existing recipes and packages.

In the imminent versions of hardhat and recipes, a new recipes function will let you say what is required at bake time and what is not. It puts all of the choice into the recipe object and the workflow and hardhat objects are mostly agnostic to these choices.

We're doing the most extensive reverse dependency checking that we can for these releases.

mdancho · June 29, 2022, 12:49pm

Thanks for this. That's what I thought the intent was, so I'm glad that you still have this intention. It's very useful with ID features.

My main issue is that date fields were causing issues even when using update_role() and setting the indicator to something like "date-field". The xgboost model was failing, but let me dig a bit deeper.

My main question was answered. So thank you!

Max · June 29, 2022, 7:52pm

Is the date column in the data being baked/predicted? It would be great to add a reprex here or in a recipe issue (with the devel hardhat and recipes).

We are about to sent hardhat and recipes to CRAN and want to make sure that the devel versions are good for you.

davis · June 29, 2022, 10:02pm

Yea @mdancho it would be great to see a reprex with your real recipe. As Max said, you should also try with dev hardhat and dev recipes where we have changed things again (hopefully for the last time, I think we have it right now)

It's very useful with ID features

We now believe that if you have columns that you aren't using in any of the recipe steps, then it is better to exclude them from the data passed to recipe() rather than setting them to an "id" role. This is doubly true if you use a recipe in combination with a workflow. It is likely to just cause confusion otherwise.

date fields were causing issues even when using update_role() and setting the indicator to something like "date-field"

I imagine that you probably used something like step_date(date) %>% update_role(date, new_role = "date-field")? This should work again with the dev versions of recipes/hardhat, but we really advise you to not do it this way anymore. You should instead do step_date(date) %>% step_rm(date) or step_date(date, keep_original_cols = FALSE) if you are trying to remove the date column after you've used it. The date column is a "predictor" because it was used to derive other predictors, so it should have the "predictor" role, not a "date-field" role.

I think that the custom "roles" feature of recipes isn't really all that useful anymore for the average user. If you stick with outcome/predictor roles and use step_rm() or keep_original_cols = FALSE to drop columns that are "used up" then that should accomplish much of the same thing in a more robust way.

system · July 6, 2022, 10:02pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.