Score new data with workflows and vetiver

john.smith · October 24, 2022, 7:24am

Hi,

I am trying to implement an ML scoring model using vetiver.
Basically I have a file that trains a model which then saves that model to a model folder to be picked up by a different script to actually screen the raw data.

Below some pseudo code

library(vetiver)
library(tidymodels)
library(pins)

### Some Code for Data manipulation using recipes

# Here I fit my final XGB model to a recipe i created based on finding optimal values for parameters
mod_final <- final_xgb %>% 
  fit(mydf)

# Then i save the version model to a folder call 'model'
v <- mod_final %>% 
  vetiver_model(model_name = "test-mod")
  
# I write the vetiver object to a network drive
model_board <- board_folder(path = here::here('model'))
model_board %>% vetiver_pin_write(v)

Now i have a folder which contains the vetiver object in the folder model/20221024T132328Z-8fbb5/test-mod.rds

The second part of the project involves scoring the data in a completely different script and in a completely different environment.
In general without using vetiver, I know i can save the workflow as an RDS and when i import it in again and apply it to new data it will apply the transformations and then score the data.

I am not able to work this out with vetiver

library(vetiver)

# Pull in our data to be screened from the DB..
mydf <- read_from_db()

# Now we pull in the vetiver object. Since we have a new version of the model each time we train
# we try and find the most recent vetiver object and sort by the most recent created model and import that in.
all_paths <- list.dirs(path = here::here("model")) %>% 
  enframe() %>% 
  filter(str_detect(value, '[0-9]')) %>% 
  mutate(modified_date = file.info(.$value)$ctime) %>% 
  filter(modified_date == max(modified_date))

mod_path = str_c(all_paths$value, "test-mod.rds", sep = '/')
eu_wf_model <- readRDS(mod_path)

# Pull the new data from the database
score_df <- pull_daily_data()

# Here we score the attributes and here is where it breaks
report <- score_df %>% 
  bind_cols(predict(eu_wf_model, score_df ,  type = "prob"))

It breaks on the last line with the error

Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "list"

This is because the vetiver object is a list of $model, $raw, $ptype and $required_pkgs

My question is: Is there a way to use vetiver to apply a workflow to the new data?

Thank you for your time

julia · October 29, 2022, 6:33pm

Yes, you'll want to make a pins board in the new environment which also has access to where you originally stored the model, something like:

library(pins)
board <- board_folder(path = here::here('model'))

And then you can read the model:

library(vetiver)
eu_wf_model <- board %>% vetiver_pin_read("test-mod")

That model will be an object you can predict with:

predict(eu_wf_model, score_df , type = "prob")

Compared to an approach where you store a bare model object .rds file on a shared drive, the benefits of vetiver are automatic versioning (you can either specify the version to use, or you will get the most recent by default), automatic careful checking of the input data types, and automatic handling of references needed for prediction (via bundle). I see that you are using an xgboost model, and it can be especially prone to those failure modes.

john.smith · November 4, 2022, 11:18am

Hi @julia,

Thank you very much. This works perfectly for me now

Thanks

system · November 11, 2022, 11:19am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.