This is the closest "Learn" topic I could find. Generally it is a best practice to be able to "save" imputation models and scaling/centering parameters based on the training data set (these affect the parameters in any modeling algorithm) and use these same imputation models and parameters on the testing or future data to be scored rather than redoing them (as this example seems to indicate) for each and every data set that flows through the workflow and predict. I was able to do this previously in CARET. Maybe I am missing something in Tidymodels.
The learn topic below seems to indicate differently.
I'm not sure I understand your question. As far as I know a trained workflow remembers the values from the trainingdata. New data will be centered with the values from the trainingset.
Great news, and thanks Max. The other clarification is how they are to be saved. Is it done with the final "Predict" or otherwise. And, is it done strictly through YAML or other methods?
The workflow object contains the preprocessing object (e.g. a recipe) and that stores all of the information used to encode/format/preprocess new data.
In some cases, the model function itself might do some of this. When that is the case, the model object would contain the training set statistics.
For example:
library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())
rec <-
recipe(mpg ~ ., data = mtcars) %>%
step_normalize(all_numeric_predictors(), id = "norm")
model_fit <-
workflow() %>%
add_recipe(rec) %>%
add_model(linear_reg()) %>%
fit(data = mtcars)
# Get the "fitted" recipe:
model_fit %>%
extract_recipe()
#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 10
#>
#> Training data contained 32 data points and no missing data.
#>
#> Operations:
#>
#> Centering and scaling for cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb [trained]
# Get the training set means and sds
model_fit %>%
extract_recipe() %>%
tidy(id = "norm")
#> # A tibble: 20 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 cyl mean 6.19 norm
#> 2 disp mean 231. norm
#> 3 hp mean 147. norm
#> 4 drat mean 3.60 norm
#> 5 wt mean 3.22 norm
#> 6 qsec mean 17.8 norm
#> 7 vs mean 0.438 norm
#> 8 am mean 0.406 norm
#> 9 gear mean 3.69 norm
#> 10 carb mean 2.81 norm
#> 11 cyl sd 1.79 norm
#> 12 disp sd 124. norm
#> 13 hp sd 68.6 norm
#> 14 drat sd 0.535 norm
#> 15 wt sd 0.978 norm
#> 16 qsec sd 1.79 norm
#> 17 vs sd 0.504 norm
#> 18 am sd 0.499 norm
#> 19 gear sd 0.738 norm
#> 20 carb sd 1.62 norm
Thanks, clear enough, but, I meant for processing future data, ass in scoring new inputs from a future use. This would be outside and independent from the initial workflow set up. Could me months from when the original model was set up.
You can save the workflow object itself. The preprocessing is in there and, when you use predict, it is all handled automatically. There are no extra steps.