Hi, I'm wondering what the suggested course of action would be when a recipe that uses step_percentile encounters a new data value outside the range on which it was prepped.
Example:
library(dplyr)
library(recipes)
train_df <- tibble(
a = 1:10,
b = 10:1
)
rec <-
train_df %>%
recipe(a ~ b) %>%
step_percentile(
b,
options = list(
probs = seq(0, 1, by = 1/4)
)
) %>%
prep()
new_df <- tibble(a = c(1, 4, 5), b = c(0.99, 5, 10.01))
bake(rec, new_data = new_df)
#> # A tibble: 3 x 2
#> b a
#> <dbl> <dbl>
#> 1 NA 1
#> 2 0.444 4
#> 3 NA 5
I understand why it is returning NA, but I could see it being desirable to have values outside the range of the training data be set to the highest/lowest quantile value. Since that isn't an option, would it simply be best create a recipe step to cap the data to a pre-determined range?