Modifying Vetiver API input

brndngrhm · December 30, 2024, 2:31pm

I have a vetiver model API published on Posit Connect. Recently one of the ID variables in the dataset (not used for prediction) had its data type in our database change from numeric to character, and now the model API returns an error since the ID variable is the wrong type. I cannot convert it to numeric b/c it is composed of numbers and letters.

is it possible to modify the vetiver API input to avoid retraining and re-publishing the the model?

julia · January 6, 2025, 5:46pm

Can you share a reprex showing how your model was created in terms of roles and then what changed? I know you likely can't use your real data, but can you create a similar model with something like the biomass data in terms of roles? I am having a hard time understanding what might be going on here.

Currently, I have two bits of info to share that might be helpful.

Take a look at how roles are handled and updated in recipes. You may be particularly interested in bake = FALSE for ID variables, but making that change would involved retraining the model AFAIK.
Check out the possibility of using a custom "prototype" for your vetiver model, when you set up your deployable model bundle via vetiver_model(). You can specify exactly what columns you want to be checked and can choose not to include the ID variable here.

brndngrhm · January 6, 2025, 7:39pm

thanks! i created a reprex below and in doing so think i found a solution? but also in the process generated 2 new questions.

So in our database, not only did the data type change but the column name changed as well (e.g. cust_id -> customer_id). I was worried that the new column, customer_id, would either break the model or even worse somehow be included as a predictor in the model (since it didnt exist and thus wasn't explicitly made an ID variable in the recipe), so to preempt any issues I was trying to rename it back to the expected column name, cust_id, and it was here i was running into my data type issue.

But in the repex below, it seems I can create the expected column (cust_id) with all values set to 0, and the addition of the new column, customer_id, doesn't seem to impact the model.

So i guess i have 2 new questions after making this reprex:

Interestingly, the ID vars are not returned when printing the vetiver prototype, why is this?
is the model just ignoring new customer_id column?

library(tidymodels)
library(tidyverse)
library(vetiver)
library(pins)
library(ids)

biomass <- modeldata::biomass

# create IDs

# old ID was abbreviated and numeric
old_ids <- tibble(cust_id = seq(111111, 111111 + biomass %>% filter(dataset == "Training") %>% nrow()-1, 1))
head(old_ids)
#> # A tibble: 6 × 1
#>   cust_id
#>     <dbl>
#> 1  111111
#> 2  111112
#> 3  111113
#> 4  111114
#> 5  111115
#> 6  111116

# new one is spelled out and varchar
new_ids <- tibble(customer_id = ids::random_id(n = biomass %>% filter(dataset == "Testing") %>% nrow(), bytes = 3))
head(new_ids)
#> # A tibble: 6 × 1
#>   customer_id
#>   <chr>      
#> 1 277a90     
#> 2 b28e56     
#> 3 e152c7     
#> 4 afc1a1     
#> 5 14b08f     
#> 6 18d44e

# add Ids to train/test
biomass_old <- 
  biomass %>%
  filter(dataset == "Training") %>%
  bind_cols(., old_ids)

biomass_new <- 
  biomass %>%
  filter(dataset == "Testing") %>%
  bind_cols(., new_ids)

# modelling stuff
recipe <- 
  recipe(HHV ~ ., data = biomass_old) %>%
  update_role(sample, cust_id, dataset, new_role = "ID Variable") |> 
  step_normalize(all_numeric_predictors())

lm_spec <- linear_reg(mode = "regression", engine = "lm", penalty = NULL, mixture = NULL)

biomass_wf <-  
  workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(lm_spec)

wflow_fit <- fit(biomass_wf, data = biomass_old)

wflow_fit %>% tidy()
#> # A tibble: 6 × 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)  19.2       0.0699   274.    0        
#> 2 carbon        3.64      0.0964    37.8   9.80e-142
#> 3 hydrogen      0.264     0.0855     3.08  2.18e-  3
#> 4 oxygen        0.139     0.113      1.23  2.19e-  1
#> 5 nitrogen     -0.0310    0.0794    -0.390 6.97e-  1
#> 6 sulfur        0.265     0.0761     3.48  5.50e-  4

# pin model to board
b <- pins::board_temp()
v <- vetiver_model(wflow_fit, "biomass", save_prototype = TRUE)

# note the ID vars are missing from prototpe? confirmed this also occurs with my real world deployed model as well
v$prototype
#> # A tibble: 0 × 5
#> # ℹ 5 variables: carbon <dbl>, hydrogen <dbl>, oxygen <dbl>, nitrogen <dbl>,
#> #   sulfur <dbl>

# trying to predict with the new data gives a 'required columns are missing' error
augment(v, new_data = biomass_new)
#> Error in `validate_column_names()`:
#> ! The following required columns are missing: 'cust_id'.

# trying to rename to the old ID name gives a datatype mismatch error
augment(v, new_data = biomass_new %>% rename(cust_id = customer_id))
#> Error:
#> ! Can't convert `data$cust_id` <character> to match type of `cust_id` <double>.

# I think this works?
augment(v, new_data = biomass_new %>% mutate(cust_id = 0))
#> # A tibble: 80 × 12
#>    .pred   .resid sample    dataset carbon hydrogen oxygen nitrogen sulfur   HHV
#>    <dbl>    <dbl> <chr>     <chr>    <dbl>    <dbl>  <dbl>    <dbl>  <dbl> <dbl>
#>  1  18.7 -0.391   Almond S… Testing   46.4     5.67   47.2     0.3    0.22  18.3
#>  2  17.6 -0.00192 Almond T… Testing   43.2     5.5    48.1     2.85   0.34  17.6
#>  3  17.4 -0.197   Animal W… Testing   42.7     5.5    49.1     2.4    0.3   17.2
#>  4  18.8  0.0688  Asparagu… Testing   46.4     6.1    37.3     1.8    0.5   18.9
#>  5  19.5  1.08    Bamboo W… Testing   48.8     6.32   42.8     0.2    0     20.5
#>  6  17.8  0.648   Barley S… Testing   44.3     5.5    41.7     0.7    0.2   18.5
#>  7  16.2 -1.12    Beet Roo… Testing   38.9     5.23   54.1     1.19   0.51  15.1
#>  8  16.8 -0.516   Bio-Dry … Testing   42.1     4.66   33.8     0.95   0.2   16.2
#>  9  15.0 -3.86    Black Li… Testing   29.2     4.4    31.1     0.14   4.9   11.1
#> 10  11.8 -1.09    Brown Ke… Testing   27.8     3.77   23.7     4.63   1.05  10.8
#> # ℹ 70 more rows
#> # ℹ 2 more variables: customer_id <chr>, cust_id <dbl>
Created on 2025-01-06 with reprex v2.1.0

julia · January 8, 2025, 8:44pm

I'm glad that you are finding a solution that lets you move forward!

The vetiver prototype for tidymodels objects only includes the predictors, and then that is what is used to convert/check new data at prediction time.
The model itself (by which I mean what gets passed to lm() in your example) does not see the customer_id column; it only will get passed columns that are predictors. You can see that when you tidy() the fitted model. The tidymodels workflow, on the other hand, is expecting to see all the columns that it was trained with, which includes the ID variables. If you don't want it to require those when it comes to predict, then you'll want to check out this advice around how to use bake = FALSE and whether to include that column at all.

brndngrhm · January 9, 2025, 4:18pm

thanks! will def keep bake = FALSE in mind in the future.

system · January 16, 2025, 4:19pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.