Recipe Sequence

I'm using recipes with TidyModels and can't seem to get the process steps to reliably output data as required. As you can see from the reprex, bake with new_data as null should return a processed set of training data when retain = T (I believe). However, the bake data is not being centered or scaled, correlation thresholds do not reduce predictors, pca does not output, etc. My objective is to verify the outputs are working properly before training. Any ideas what I'm doing wrong?

library(tidymodels)
library(textrecipes)
library(themis)
library(embed)

t <- "Target"

Target <- as.factor(sample(c("A", "B"), 100, replace = TRUE))
Other <- as.factor(sample(c("AA", "BB", "CCC", "DDD"), 100, replace = TRUE))
Numb1 <- sample(1:100, 100, replace = TRUE)
Numb2 <- sample(1:100, 100, replace = TRUE)
df <- tibble(Target, Other, Numb1, Numb2)

rec <- recipe(as.formula(glue("{t} ~.")), data = df) %>%
step_zv(all_predictors(), skip = F) %>%
step_impute_knn(all_predictors(), skip = F) %>%
step_dummy(all_nominal_predictors(), one_hot = T, skip = F) %>%
step_clean_names(all_predictors(), skip = F) %>%
step_center(all_numeric_predictors(), skip = F) %>%
step_scale(all_numeric_predictors(), skip = F) %>%
step_corr(all_numeric_predictors(), threshold = 0, method = "pearson", skip = F) %>%
step_upsample(t, skip = T) %>%
step_pca_truncated(all_numeric_predictors(), num_comp = 5, skip = F)

prep <- prep(rec, training = df, retain = T)
bake(prep, new_data = NULL)

good article on sequencing steps: Ordering of steps • recipes which I checked against this flow

ok, after working with this for the last couple days what I've learned is sequence REALLY matters. I've resolved this example through removing steps and changing sequence, and actually decided to perform a processing step including more data cleaning prior to this recipe which supports learning. This simply allows me to QA what's happening with real use cases.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.