Saving a model after tuning??

cwright1 · January 31, 2023, 10:44pm

My model was tuned and seems to work well, but now I'm not sure how to save it and re-apply it.

Min reprex: (made following this tutorial to generate and tune the model, and this article on saving the model with {yaml} and {tidypredict}, using iris data.).

library(tidymodels)
library(tidypredict)
library(yaml)

tidymodels_prefer()

#Initial split, generate training and testing
mysplit <- initial_split(iris %>% select(-Species), strata=Petal.Width)
training_set <- training(mysplit)
test_set <- testing(mysplit)


#Set up the model specification
#The hyperparameters will be tuned
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),                    
  sample_size = tune(),
  mtry = tune(),   
  learn_rate = tune()                          
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")


#Set up a space-filling grid design to cover the hyperparameter space as well as possible
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), training_set), #gets treated differently b/c it depends on actual # of predictors in data
  learn_rate(),
  size = 30
)

#Put the model specification into a workflow
xgb_wf <- workflow() %>%
  add_formula(Petal.Width ~.) %>% 
  add_model(xgb_spec)


#Create cross-validation resamples for tuning the model
input_folds <- vfold_cv(training_set, strata=Petal.Width)


#Use tunable workflow to tune
doParallel::registerDoParallel()
xgb_res <- tune_grid(
  xgb_wf,
  resamples = input_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)


#Select the best parameters based on RMSE
best_rmse <- select_best(xgb_res, "rmse")


#Finalize the tuneable workflow using the best parameters
final_xgb <- finalize_workflow(
  xgb_wf,
  best_rmse
)

#############
#Fit the final best model to training set and evaluate the test set
final_res <- last_fit(final_xgb, mysplit)
#############


#Get the model-predicted values of the test set
pred_df <- 
  final_res %>%
  collect_predictions() %>%
  as.data.frame()

But now I'm a little confused on how to save this model so it can be re-run ? (I know normally there would be an extra re-training step on the entire dataset, which I'm skipping here).

parsed <- parse_model(extract_fit_engine(final_res)) #Is this right?
write_yaml(parsed, "my_model.yml")
loaded_model <- read_yaml("my_model.yml")

loaded_model <- as_parsed_model(loaded_model)

If this is correct, how would I fit it on the test set again? The same test set is fine as a toy. I thought it would be like this, but no luck:

loaded_model %>% fit(Petal.Width ~., data = iris)

Error in UseMethod("fit") :
no applicable method for 'fit' applied to an object of class "list"

williaml · January 31, 2023, 11:54pm

You could use the bundle package:

Serialize Model Objects with a Consistent Interface • bundle (rstudio.github.io)

Max · January 31, 2023, 11:55pm

@williaml beat me to it!

Here's a specific link for xgboost objects.

cwright1 · February 1, 2023, 2:26am

Hi @Max ,

I found that one thing causing my deployment not to work was the butcher step. Maybe I overlooked some documentation that mentions this not working with xgboost objects? After removing it, things worked (I think).

In my reprex below, near the end is how I got the model to save and predict. Does this look right? I am not sure if I'm supposed to be doing the 'fit' stage again, then predict.

library(tidymodels)
library(tidypredict)
library(yaml)

tidymodels_prefer()

#Initial split, generate training and testing
mysplit <- initial_split(iris %>% select(-Species), strata=Petal.Width)
training_set <- training(mysplit)
test_set <- testing(mysplit)


#Set up the model specification
#The hyperparameters will be tuned
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),                    
  sample_size = tune(),
  mtry = tune(),   
  learn_rate = tune()                          
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")


#Set up a space-filling grid design to cover the hyperparameter space as well as possible
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), training_set), #gets treated differently b/c it depends on actual # of predictors in data
  learn_rate(),
  size = 30
)

#Put the model specification into a workflow
xgb_wf <- workflow() %>%
  add_formula(Petal.Width ~.) %>% 
  add_model(xgb_spec)


#Create cross-validation resamples for tuning the model
input_folds <- vfold_cv(training_set, strata=Petal.Width)


#Use tunable workflow to tune
doParallel::registerDoParallel()
xgb_res <- tune_grid(
  xgb_wf,
  resamples = input_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)


#Select the best parameters based on RMSE
best_rmse <- select_best(xgb_res, "rmse")


#Finalize the tuneable workflow using the best parameters
final_xgb <- finalize_workflow(
  xgb_wf,
  best_rmse
)

#############
#Fit the final best model to training set and evaluate the test set
final_res <- last_fit(final_xgb, mysplit)
#############


#Get the model-predicted values of the test set
pred_df <- 
  final_res %>%
  collect_predictions() %>%
  as.data.frame() 




#Save Model
res_bundle <-
  final_xgb %>% #It shouldn't be extract_fit_engine(final_res) ?
  bundle()


#Save RDS, remove , read it back in
saveRDS(res_bundle, file="res_bundle.RDS")
rm(res_bundle)
res_bundle <- readRDS("res_bundle.RDS")

#unbundle
xgb_unbundled <- unbundle(res_bundle)



testfit <- xgb_unbundled %>% fit( data = iris) #Have to fit it again?


my_prediction <- as.data.frame(predict(testfit, 
                                       iris))

cwright1 · February 1, 2023, 2:27am

Thanks! Not sure if I am doing it correctly now or not? I posted the code I got working in reponse to Max below, but i had to do the 'fit' again

williaml · February 1, 2023, 2:49am

Looks alright to me:

augment(testfit, new_data = iris) %>% 
  ggplot(aes(Petal.Width, .pred, colour = Species)) +
  geom_point()

Max · February 1, 2023, 4:36am

No. If you started with a workflow, keep it as a workflow. If you have any factor-type predictors, the workflow would split them into different indicator columns (which, for now, xgboost requires).

I think the issue is that you were exporting the wrong thing. See below for two lines near the end that I changed:

library(tidymodels)
library(tidypredict)
library(yaml)
library(butcher)
library(bundle)

tidymodels_prefer()

#Initial split, generate training and testing
mysplit <- initial_split(iris %>% select(-Species), strata=Petal.Width)
training_set <- training(mysplit)
test_set <- testing(mysplit)


#Set up the model specification
#The hyperparameters will be tuned
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),                    
  sample_size = tune(),
  mtry = tune(),   
  learn_rate = tune()                          
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")


#Set up a space-filling grid design to cover the hyperparameter space as well as possible
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), training_set), #gets treated differently b/c it depends on actual # of predictors in data
  learn_rate(),
  size = 30
)

#Put the model specification into a workflow
xgb_wf <- workflow() %>%
  add_formula(Petal.Width ~.) %>% 
  add_model(xgb_spec)


#Create cross-validation resamples for tuning the model
input_folds <- vfold_cv(training_set, strata=Petal.Width)


#Use tunable workflow to tune
doParallel::registerDoParallel()
xgb_res <- tune_grid(
  xgb_wf,
  resamples = input_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)


#Select the best parameters based on RMSE
best_rmse <- select_best(xgb_res, "rmse")


#Finalize the tuneable workflow using the best parameters
final_xgb <- finalize_workflow(
  xgb_wf,
  best_rmse
)

#############
#Fit the final best model to training set and evaluate the test set
final_res <- last_fit(final_xgb, mysplit)
#############


#Get the model-predicted values of the test set
pred_df <- 
  final_res %>%
  collect_predictions() %>%
  as.data.frame() 

#Save Model
res_bundle <-
  final_res %>%            #<- changed
  extract_workflow() %>%   #<- changed
  butcher() %>% 
  bundle()

#Save RDS, remove , read it back in
saveRDS(res_bundle, file="~/tmp/res_bundle.RDS")

After a restart, that worked for me:

library(tidymodels)
library(bundle)

tidymodels_prefer()


res_bundle <- readRDS("~/tmp/res_bundle.RDS")

#unbundle
xgb_unbundled <- unbundle(res_bundle)

predict(xgb_unbundled, head(iris))
#> # A tibble: 6 × 1
#>   .pred
#>   <dbl>
#> 1 0.291
#> 2 0.208
#> 3 0.209
#> 4 0.181
#> 5 0.249
#> 6 0.326

^{Created on 2023-01-31 by the reprex package (v2.0.1)}

system · February 8, 2023, 4:36am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.