Thanks @mattwarkentin. The topic of data leakage is new to me. In addition, learnable parameters is something I wasn't aware of when considering feature engineering. To make this discussion more concrete, I'd like to use an example. Below I provide some code for preprocessing with recipes
and then the dplyr
et al. equivalent. I would be grateful if you (or any other member here) could point to how data leakage would look like in the following code. Second, what learnable parameters we have here that recipes
solves whereas dplyr
does not.
The following code is adopted from Hansjörg Plieninger's blog post where he gives a tidymodels walkthrough.
We use the diamonds
data from ggplot2
.
First, let's show how we would build a recipes
specification.
library(rsample)
library(recipes)
library(ggplot2) # for diamonds data
set.seed(123)
# step 1: split data to training and testing
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split <- initial_split(my_diamonds, prop = .1)
d_training <- training(init_split)
# specify recipe
diamonds_recipe <-
recipe(formula = price ~ ., data = d_training) %>%
step_log(price) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_dummy(cut) %>%
step_poly(carat, degree = 2) %>%
prep()
# save the wrangled training data (that was wrangled according to recipe) to object
d_training_preprocessed_by_recipe <-
diamonds_recipe %>%
bake(new_data = NULL)
d_training_preprocessed_by_recipe
#> # A tibble: 5,394 x 7
#> price cut_1 cut_2 cut_3 cut_4 carat_poly_1 carat_poly_2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6.22 0.632 0.535 3.16e- 1 0.120 -0.0137 0.0110
#> 2 9.57 0 -0.535 -4.10e-16 0.717 0.0198 -0.00484
#> 3 6.88 0.316 -0.267 -6.32e- 1 -0.478 -0.00750 -0.000876
#> 4 8.80 0.632 0.535 3.16e- 1 0.120 0.00599 -0.0127
#> 5 7.27 0.632 0.535 3.16e- 1 0.120 -0.00778 -0.000425
#> 6 9.63 0.316 -0.267 -6.32e- 1 -0.478 0.0389 0.0393
#> 7 8.29 0.316 -0.267 -6.32e- 1 -0.478 0.00571 -0.0126
#> 8 8.57 0 -0.535 -4.10e-16 0.717 0.0113 -0.0120
#> 9 7.41 0.632 0.535 3.16e- 1 0.120 -0.00525 -0.00418
#> 10 8.07 0.632 0.535 3.16e- 1 0.120 0.00599 -0.0127
#> # ... with 5,384 more rows
Now let's assume that we want to take the dplyr
path instead. This means that we will not use recipes
at all. However, splitting to testing and training data is still relevant after we wrangle the data.
library(dplyr)
library(tibble)
# equivalent to `step_dummy()`
mutate_dummy_contr.poly <- function(.dat, colname) {
colname <- deparse(substitute(colname))
col_as_vec <- .dat[[colname]]
stopifnot(is.factor(col_as_vec))
factor_levels <- levels(col_as_vec)
contr.poly(factor_levels) %>%
as_tibble() %>%
setNames(paste(colname, as.character(1:4), sep = "_")) %>%
add_column("{colname}" := factor_levels, .before = 0) %>%
left_join(.dat, ., by = colname) %>%
select(-colname)
}
# equivalent to `step_poly()`
mutate_poly_coefs <- function(.dat, colname, deg) {
colname <- deparse(substitute(colname))
poly(x = .dat[[colname]], degree = deg) %>%
as_tibble() %>%
setNames(paste(as.character(colname), as.character(1:deg), sep = "_")) %>%
bind_cols(.dat, .)
}
my_diamonds_preproc_with_dplyr <-
my_diamonds %>%
mutate(across(price, log)) %>%
mutate(across(carat, ~as.numeric(scale(.)))) %>%
mutate_dummy_contr.poly(cut) %>%
mutate_poly_coefs(carat, deg = 2)
init_split_dplyr <- initial_split(my_diamonds_preproc_with_dplyr, prop = .1)
d_training_dplyr <- training(init_split_dplyr)
d_testing_dplyr <- testing(init_split_dplyr)
In summary. the second approach first wrangles the entire my_diamonds
data using dplyr
and then splitting it to testing and training. What data leakage could possibly happen here, and what learnable parameters do I miss learning in this way?
Otherwise, I prefer writing this wrangling/feature engineering code explicitly, so I can clearly see what operations are carried on the data, instead of "black box" wrappers such as step_*()
from recipes
.
I would be happy for anyone who can chime in and add their 2 cents. I could not find any intelligent discussion about this topic elsewhere.
Thanks!