tl;dr
I essentially just want to create a tibble of the original/derived variable parings from a prepped recipe.
For recipe steps that create one or more new features (thinking specifically of step_dummy()
in my case), what is the best way of identifying the original variables from which the derived variable/s was/were created?
For example:
library(tidymodels)
library(tidyverse)
rec <-
recipe(
Sepal.Length ~ Species + Sepal.Width,
data = iris
) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
prepped <- prep(rec)
The last_term_info
list item of the prepped recipe seems to be pretty close. One way of doing this is to iterate up the data frame, find where source == "derived"
and continue upwards until it finds an "original"
row.
prepped$last_term_info
# A tibble: 5 x 6
# Groups: variable [5]
variable type role source number skip
<chr> <chr> <list> <chr> <dbl> <lgl>
1 Sepal.Length numeric <chr [1]> original 2 FALSE
2 Sepal.Width numeric <chr [1]> original 2 FALSE
3 Species nominal <chr [1]> original 1 FALSE
4 Species_versicolor numeric <chr [1]> derived 2 FALSE
5 Species_virginica numeric <chr [1]> derived 2 FALSE
I'm worried about the idea above because I don't like relying on the row order and feels kind of hacky, and it also would not work at all in the case of something like step_pca()
.
I could also see doing string manipulation, removing the suffix after Species
but I feel like there are a lot of ways that could go wrong if there are other similarly named variables in the recipe. Can anyone thing of a better way of doing this?
I am imagining the output looking something like this:
# A tibble: 2 x 2
derived_variable original_variable
<chr> <chr>
1 Species_versicolor Species
2 Species_virginica Species
Thanks!