Serializing workflows with XGBoost models

vadimus · August 4, 2022, 4:02pm

The xgboost documentation cautions against using saveRDS in favor of xgb.save, for storing trained models for future scoring. Introduction to Model IO — xgboost 1.6.1 documentation

We guarantee backward compatibility for models but not for memory snapshots.

Models (trees and objective) use a stable representation, so that models produced in earlier versions of XGBoost are accessible in later versions of XGBoost. If you’d like to store or archive your model for long-term storage, use save_model (Python) and xgb.save (R).

On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for checkpointing only, where you persist the complete snapshot of the training configurations so that you can recover robustly from possible failures and resume the training process. Loading memory snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors. If a model is persisted with pickle.dump (Python) or saveRDS (R), then the model may not be accessible in later versions of XGBoost.

Does it mean we should avoid serializing tydymodels workflow objects as well, when there is a possibility of an upgraded version of xgboost being used during scoring in the future?

simoncouch · August 4, 2022, 5:59pm

Thanks for the post, @vadimus!

Does it mean we should avoid serializing tidymodels workflow objects as well, when there is a possibility of an upgraded version of xgboost being used during scoring in the future?

For now, xgboost models fitted with parsnip (and possible wrappers like workflows, workflowsets, stacks) will be susceptible to the kind of instability across versions noted in the documentation you've linked to.

We're actively working on better infrastructure for supporting native serialization like xgboost's xgb.save(). That work currently lives at GitHub - rstudio/bundle: Prepare objects for serialization with a consistent interface if you'd like to follow our development, but we hope to integrate this functionality under the hood in objects outputted by tidymodels / vetiver soon. I'd anticipate this work to reach our CRAN packages before the end of the year.

EDIT: The bundle package is now on CRAN, and supports serializing tidymodels workflows fitted with xgboost. Read more: Announcing bundle

vadimus · August 10, 2022, 11:41am

That's great. Thanks for the update.

system · August 31, 2022, 11:42am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.