Modifying Vetiver API input

I have a vetiver model API published on Posit Connect. Recently one of the ID variables in the dataset (not used for prediction) had its data type in our database change from numeric to character, and now the model API returns an error since the ID variable is the wrong type. I cannot convert it to numeric b/c it is composed of numbers and letters.

is it possible to modify the vetiver API input to avoid retraining and re-publishing the the model?

1 Like

Can you share a reprex showing how your model was created in terms of roles and then what changed? I know you likely can't use your real data, but can you create a similar model with something like the biomass data in terms of roles? I am having a hard time understanding what might be going on here.

Currently, I have two bits of info to share that might be helpful.

thanks! i created a reprex below and in doing so think i found a solution? but also in the process generated 2 new questions.

So in our database, not only did the data type change but the column name changed as well (e.g. cust_id -> customer_id). I was worried that the new column, customer_id, would either break the model or even worse somehow be included as a predictor in the model (since it didnt exist and thus wasn't explicitly made an ID variable in the recipe), so to preempt any issues I was trying to rename it back to the expected column name, cust_id, and it was here i was running into my data type issue.

But in the repex below, it seems I can create the expected column (cust_id) with all values set to 0, and the addition of the new column, customer_id, doesn't seem to impact the model.

So i guess i have 2 new questions after making this reprex:

  1. Interestingly, the ID vars are not returned when printing the vetiver prototype, why is this?
  2. is the model just ignoring new customer_id column?
library(tidymodels)
library(tidyverse)
library(vetiver)
library(pins)
library(ids)

biomass <- modeldata::biomass

# create IDs

# old ID was abbreviated and numeric
old_ids <- tibble(cust_id = seq(111111, 111111 + biomass %>% filter(dataset == "Training") %>% nrow()-1, 1))
head(old_ids)
#> # A tibble: 6 × 1
#>   cust_id
#>     <dbl>
#> 1  111111
#> 2  111112
#> 3  111113
#> 4  111114
#> 5  111115
#> 6  111116

# new one is spelled out and varchar
new_ids <- tibble(customer_id = ids::random_id(n = biomass %>% filter(dataset == "Testing") %>% nrow(), bytes = 3))
head(new_ids)
#> # A tibble: 6 × 1
#>   customer_id
#>   <chr>      
#> 1 277a90     
#> 2 b28e56     
#> 3 e152c7     
#> 4 afc1a1     
#> 5 14b08f     
#> 6 18d44e

# add Ids to train/test
biomass_old <- 
  biomass %>%
  filter(dataset == "Training") %>%
  bind_cols(., old_ids)

biomass_new <- 
  biomass %>%
  filter(dataset == "Testing") %>%
  bind_cols(., new_ids)

# modelling stuff
recipe <- 
  recipe(HHV ~ ., data = biomass_old) %>%
  update_role(sample, cust_id, dataset, new_role = "ID Variable") |> 
  step_normalize(all_numeric_predictors())

lm_spec <- linear_reg(mode = "regression", engine = "lm", penalty = NULL, mixture = NULL)

biomass_wf <-  
  workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(lm_spec)

wflow_fit <- fit(biomass_wf, data = biomass_old)

wflow_fit %>% tidy()
#> # A tibble: 6 × 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)  19.2       0.0699   274.    0        
#> 2 carbon        3.64      0.0964    37.8   9.80e-142
#> 3 hydrogen      0.264     0.0855     3.08  2.18e-  3
#> 4 oxygen        0.139     0.113      1.23  2.19e-  1
#> 5 nitrogen     -0.0310    0.0794    -0.390 6.97e-  1
#> 6 sulfur        0.265     0.0761     3.48  5.50e-  4

# pin model to board
b <- pins::board_temp()
v <- vetiver_model(wflow_fit, "biomass", save_prototype = TRUE)

# note the ID vars are missing from prototpe? confirmed this also occurs with my real world deployed model as well
v$prototype
#> # A tibble: 0 × 5
#> # ℹ 5 variables: carbon <dbl>, hydrogen <dbl>, oxygen <dbl>, nitrogen <dbl>,
#> #   sulfur <dbl>

# trying to predict with the new data gives a 'required columns are missing' error
augment(v, new_data = biomass_new)
#> Error in `validate_column_names()`:
#> ! The following required columns are missing: 'cust_id'.

# trying to rename to the old ID name gives a datatype mismatch error
augment(v, new_data = biomass_new %>% rename(cust_id = customer_id))
#> Error:
#> ! Can't convert `data$cust_id` <character> to match type of `cust_id` <double>.

# I think this works?
augment(v, new_data = biomass_new %>% mutate(cust_id = 0))
#> # A tibble: 80 × 12
#>    .pred   .resid sample    dataset carbon hydrogen oxygen nitrogen sulfur   HHV
#>    <dbl>    <dbl> <chr>     <chr>    <dbl>    <dbl>  <dbl>    <dbl>  <dbl> <dbl>
#>  1  18.7 -0.391   Almond S… Testing   46.4     5.67   47.2     0.3    0.22  18.3
#>  2  17.6 -0.00192 Almond T… Testing   43.2     5.5    48.1     2.85   0.34  17.6
#>  3  17.4 -0.197   Animal W… Testing   42.7     5.5    49.1     2.4    0.3   17.2
#>  4  18.8  0.0688  Asparagu… Testing   46.4     6.1    37.3     1.8    0.5   18.9
#>  5  19.5  1.08    Bamboo W… Testing   48.8     6.32   42.8     0.2    0     20.5
#>  6  17.8  0.648   Barley S… Testing   44.3     5.5    41.7     0.7    0.2   18.5
#>  7  16.2 -1.12    Beet Roo… Testing   38.9     5.23   54.1     1.19   0.51  15.1
#>  8  16.8 -0.516   Bio-Dry … Testing   42.1     4.66   33.8     0.95   0.2   16.2
#>  9  15.0 -3.86    Black Li… Testing   29.2     4.4    31.1     0.14   4.9   11.1
#> 10  11.8 -1.09    Brown Ke… Testing   27.8     3.77   23.7     4.63   1.05  10.8
#> # ℹ 70 more rows
#> # ℹ 2 more variables: customer_id <chr>, cust_id <dbl>
Created on 2025-01-06 with reprex v2.1.0

I'm glad that you are finding a solution that lets you move forward!

  • The vetiver prototype for tidymodels objects only includes the predictors, and then that is what is used to convert/check new data at prediction time.
  • The model itself (by which I mean what gets passed to lm() in your example) does not see the customer_id column; it only will get passed columns that are predictors. You can see that when you tidy() the fitted model. The tidymodels workflow, on the other hand, is expecting to see all the columns that it was trained with, which includes the ID variables. If you don't want it to require those when it comes to predict, then you'll want to check out this advice around how to use bake = FALSE and whether to include that column at all.

thanks! will def keep bake = FALSE in mind in the future.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.