Do the preprocessing steps from a recipe that are then included as a tidymodels workflow impact the request body of a plumber API? I believe the answer is no based on current testing and the reprex below, but I wanted to confirm if what I am experiencing currently is expected behavior. My request body inputs may be null/missing at times for a variety of reasons, and I was hoping that the recipe part of the model would correct this on the fly in the plumber API.
In the event what I'm experiencing currently is expected behavior, what are common recommendations for imputing null/missing values for API requests so that the model generates a prediction response rather than a 500 error when a value is missing? Is the fix as easy as converting the request body to a data frame and prepping that data with the recipe from the modeling workflow before passing it to the predict() function?
# Load Libraries ----------------------------------------------------------
library(plumber)
library(tidymodels)
library(tidyverse)
# Construct Basic Model ---------------------------------------------------
# Load and split data
df = mtcars
train_df = df[1:25, ]
test_df = df[26:32, ]
train_df$disp[1:5] = NA
train_df$cyl[1:5] = NA
# Define Recipe
mod_rec = recipe(mpg ~ cyl + disp + hp, data = train_df) %>%
step_impute_median(all_numeric_predictors())
prep(mod_rec, verbose = TRUE)
#> oper 1 step impute median [training]
#> The retained training set is ~ 0 Mb in memory.
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 3
#>
#> ── Training information
#> Training data contained 25 data points and 5 incomplete rows.
#>
#> ── Operations
#> • Median imputation for: cyl, disp, hp | Trained
# Define Model
tree_mod = decision_tree() %>%
set_mode("regression") %>%
set_engine("rpart")
# Define Workflow
tree_wkflow = workflow() %>%
add_recipe(mod_rec) %>%
add_model(tree_mod)
# Fit Model
mod1 = fit(tree_wkflow, train_df)
saveRDS(mod1, file = "cars.rds")
# API ---------------------------------------------------------------------
trained_mod = readRDS("cars.rds")
#* How many mpg should we expect?
#* @post /predict_mpg
function(req, res) {
predict(trained_mod, new_data = as.data.frame(req$body))
}
#> function(req, res) {
#> predict(trained_mod, new_data = as.data.frame(req$body))
#> }
# Update UI
#* @plumber
function(pr) {
pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
}
#> function(pr) {
#> pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
#> }
I failed to include this in the original message, but here is the yaml file and a screenshot of the API when I try to pass it a missing value currently.
Thanks for reviewing this code and the recommendation with vetiver, @julia. I appreciate the prompt reply.
I tried passing "NA", and unfortunately I still receive an error. I also receive an error when I pass NA only to the API. Both error screenshots are captured below.
Did you by chance alter the yaml and/or alter the data types? It seems odd to me that I wouldn't see a successful call in the same way that you did in the screenshot you submitted.
In the process of creating this reprex and tinkering with different options, I think I may have found a solution (h/t Tom Mock & his great post about the value of a reprex). I'll post my proposed solution in a few.
Here's the alternate solution I described above. I have also added screenshots below to highlight the result when I pass the API 3 empty strings using this updated code. The print() statements confirm that the added recipe() logic is functioning as intended.
# Load Libraries ----------------------------------------------------------
library(plumber)
library(tidymodels)
library(tidyverse)
# Construct Basic Model ---------------------------------------------------
# Load and split data
df = mtcars
train_df = df[1:25, ]
test_df = df[26:32, ]
train_df$disp[1:5] = NA
train_df$cyl[1:5] = NA
# Define Recipe
mod_rec = recipe(mpg ~ cyl + disp + hp, data = train_df) %>%
step_impute_median(all_numeric_predictors())
prep(mod_rec, verbose = TRUE)
#> oper 1 step impute median [training]
#> The retained training set is ~ 0 Mb in memory.
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 3
#>
#> ── Training information
#> Training data contained 25 data points and 5 incomplete rows.
#>
#> ── Operations
#> • Median imputation for: cyl, disp, hp | Trained
# Define Model
tree_mod = decision_tree() %>%
set_mode("regression") %>%
set_engine("rpart")
# Define Workflow
tree_wkflow = workflow() %>%
add_recipe(mod_rec) %>%
add_model(tree_mod)
# Fit Model
mod1 = fit(tree_wkflow, train_df)
saveRDS(mod1, file = "cars.rds")
# API ---------------------------------------------------------------------
trained_mod = readRDS("cars.rds")
#* How many mpg should we expect?
#* @post /predict_mpg
function(req, res) {
df = as.data.frame(req$body)
df[df == ""] = NA
print(df)
for (col in colnames(df)){
df[[col]] = as.numeric(df[[col]])
}
my_rec = extract_recipe(trained_mod)
ready_for_predict_df = bake(my_rec, df)
print(ready_for_predict_df)
print("***** END TEST *****")
predict(trained_mod, new_data = ready_for_predict_df)
}
#> function(req, res) {
#> df = as.data.frame(req$body)
#> df[df == ""] = NA
#> print(df)
#>
#> for (col in colnames(df)){
#> df[[col]] = as.numeric(df[[col]])
#> }
#>
#> my_rec = extract_recipe(trained_mod)
#> ready_for_predict_df = bake(my_rec, df)
#> print(ready_for_predict_df)
#> print("***** END TEST *****")
#> predict(trained_mod, new_data = ready_for_predict_df)
#> }
# Update UI
#* @plumber
function(pr) {
pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
}
#> function(pr) {
#> pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
#> }
I did use vetiver for the screenshot I showed you earlier. You may be interested in checking it out because it handles a lot of that checking automatically, without explicitly needing to bake() which you don't really want to do in most cases (you can end up "double preprocessing" your data):
library(tidymodels)
df <- mtcars
train_df <- df[1:25, ]
test_df <- df[26:32, ]
train_df$disp[1:5] = NA
train_df$cyl[1:5] = NA
mod_rec <- recipe(mpg ~ cyl + disp + hp, data = train_df) |>
step_impute_median(all_numeric_predictors())
tree_spec <- decision_tree(mode = "regression")
tree_wkflow <- workflow(mod_rec, tree_spec)
mod1 <- fit(tree_wkflow, train_df)
## can predict on the original model
predict(mod1, tibble(cyl = 6, disp = 175, hp = NA))
#> # A tibble: 1 × 1
#> .pred
#> <dbl>
#> 1 16.6
library(vetiver)
#>
#> Attaching package: 'vetiver'
#> The following object is masked from 'package:tune':
#>
#> load_pkgs
v <- vetiver_model(mod1, "cars-rpart")
## can prediction on the vetiver model
predict(v, tibble(cyl = 6, disp = 175, hp = NA))
#> # A tibble: 1 × 1
#> .pred
#> <dbl>
#> 1 16.6
library(plumber)
pr() |> vetiver_api(v)
#> # Plumber router with 3 endpoints, 4 filters, and 1 sub-router.
#> # Use `pr_run()` on this object to start the API.
#> ├──[queryString]
#> ├──[body]
#> ├──[cookieParser]
#> ├──[sharedSecret]
#> ├──/logo
#> │ │ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/vetiver
#> ├──/metadata (GET)
#> ├──/ping (GET)
#> └──/predict (POST)
## next pipe to `pr_run()` for local API
If you want to know what vetiver is doing under the hood to convert/coerce the new data, you can look here to see the handler_predict() function and especially notice the vetiver_type_convert() function. That is an exported function so you could use it instead of bake() if it is important to your use case to code the API from scratch rather than use vetiver; that function will be more appropriate in more situations.
Thanks for the additional updates here, @julia. I wasn't exactly sure how you were able to generate a prediction using NA based on the reprex, but knowing that you used vetiver makes much more sense now. Sorry for missing that in your initial reply. I'll share this feedback with our data science team, and thanks again for your help.