thanks! i created a reprex below and in doing so think i found a solution? but also in the process generated 2 new questions.
So in our database, not only did the data type change but the column name changed as well (e.g. cust_id
-> customer_id
). I was worried that the new column, customer_id
, would either break the model or even worse somehow be included as a predictor in the model (since it didnt exist and thus wasn't explicitly made an ID variable in the recipe), so to preempt any issues I was trying to rename it back to the expected column name, cust_id
, and it was here i was running into my data type issue.
But in the repex below, it seems I can create the expected column (cust_id
) with all values set to 0, and the addition of the new column, customer_id
, doesn't seem to impact the model.
So i guess i have 2 new questions after making this reprex:
- Interestingly, the ID vars are not returned when printing the vetiver prototype, why is this?
- is the model just ignoring new
customer_id
column?
library(tidymodels)
library(tidyverse)
library(vetiver)
library(pins)
library(ids)
biomass <- modeldata::biomass
# create IDs
# old ID was abbreviated and numeric
old_ids <- tibble(cust_id = seq(111111, 111111 + biomass %>% filter(dataset == "Training") %>% nrow()-1, 1))
head(old_ids)
#> # A tibble: 6 × 1
#> cust_id
#> <dbl>
#> 1 111111
#> 2 111112
#> 3 111113
#> 4 111114
#> 5 111115
#> 6 111116
# new one is spelled out and varchar
new_ids <- tibble(customer_id = ids::random_id(n = biomass %>% filter(dataset == "Testing") %>% nrow(), bytes = 3))
head(new_ids)
#> # A tibble: 6 × 1
#> customer_id
#> <chr>
#> 1 277a90
#> 2 b28e56
#> 3 e152c7
#> 4 afc1a1
#> 5 14b08f
#> 6 18d44e
# add Ids to train/test
biomass_old <-
biomass %>%
filter(dataset == "Training") %>%
bind_cols(., old_ids)
biomass_new <-
biomass %>%
filter(dataset == "Testing") %>%
bind_cols(., new_ids)
# modelling stuff
recipe <-
recipe(HHV ~ ., data = biomass_old) %>%
update_role(sample, cust_id, dataset, new_role = "ID Variable") |>
step_normalize(all_numeric_predictors())
lm_spec <- linear_reg(mode = "regression", engine = "lm", penalty = NULL, mixture = NULL)
biomass_wf <-
workflow() %>%
add_recipe(recipe) %>%
add_model(lm_spec)
wflow_fit <- fit(biomass_wf, data = biomass_old)
wflow_fit %>% tidy()
#> # A tibble: 6 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 19.2 0.0699 274. 0
#> 2 carbon 3.64 0.0964 37.8 9.80e-142
#> 3 hydrogen 0.264 0.0855 3.08 2.18e- 3
#> 4 oxygen 0.139 0.113 1.23 2.19e- 1
#> 5 nitrogen -0.0310 0.0794 -0.390 6.97e- 1
#> 6 sulfur 0.265 0.0761 3.48 5.50e- 4
# pin model to board
b <- pins::board_temp()
v <- vetiver_model(wflow_fit, "biomass", save_prototype = TRUE)
# note the ID vars are missing from prototpe? confirmed this also occurs with my real world deployed model as well
v$prototype
#> # A tibble: 0 × 5
#> # ℹ 5 variables: carbon <dbl>, hydrogen <dbl>, oxygen <dbl>, nitrogen <dbl>,
#> # sulfur <dbl>
# trying to predict with the new data gives a 'required columns are missing' error
augment(v, new_data = biomass_new)
#> Error in `validate_column_names()`:
#> ! The following required columns are missing: 'cust_id'.
# trying to rename to the old ID name gives a datatype mismatch error
augment(v, new_data = biomass_new %>% rename(cust_id = customer_id))
#> Error:
#> ! Can't convert `data$cust_id` <character> to match type of `cust_id` <double>.
# I think this works?
augment(v, new_data = biomass_new %>% mutate(cust_id = 0))
#> # A tibble: 80 × 12
#> .pred .resid sample dataset carbon hydrogen oxygen nitrogen sulfur HHV
#> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 18.7 -0.391 Almond S… Testing 46.4 5.67 47.2 0.3 0.22 18.3
#> 2 17.6 -0.00192 Almond T… Testing 43.2 5.5 48.1 2.85 0.34 17.6
#> 3 17.4 -0.197 Animal W… Testing 42.7 5.5 49.1 2.4 0.3 17.2
#> 4 18.8 0.0688 Asparagu… Testing 46.4 6.1 37.3 1.8 0.5 18.9
#> 5 19.5 1.08 Bamboo W… Testing 48.8 6.32 42.8 0.2 0 20.5
#> 6 17.8 0.648 Barley S… Testing 44.3 5.5 41.7 0.7 0.2 18.5
#> 7 16.2 -1.12 Beet Roo… Testing 38.9 5.23 54.1 1.19 0.51 15.1
#> 8 16.8 -0.516 Bio-Dry … Testing 42.1 4.66 33.8 0.95 0.2 16.2
#> 9 15.0 -3.86 Black Li… Testing 29.2 4.4 31.1 0.14 4.9 11.1
#> 10 11.8 -1.09 Brown Ke… Testing 27.8 3.77 23.7 4.63 1.05 10.8
#> # ℹ 70 more rows
#> # ℹ 2 more variables: customer_id <chr>, cust_id <dbl>
Created on 2025-01-06 with reprex v2.1.0