Transforming the test set to the trained principal component space with tidymodels

zed · June 29, 2021, 9:47am

I followed this tutorial to do PCA preprocessing on my training data. Here is what I have done so far:

# load libraries
library(tidymodels)
library(tidyverse)

# build PCA model
pca_recipe = recipe(~., data = training_data) %>%
        update_role(232:236, new_role = "id") %>% # these are the outcomes (y labels) that I will use to train a ML model
        step_normalize(all_predictors()) %>% 
        step_pca(all_predictors(), num_comp = 10)

pca_prep = prep(pca_recipe)

pca_tidy = tidy(pca_prep, 2)

Now, in the language of linear algebra: I want to send my testing data to the same space with this trained PCs. I do not want to do another PCA for the testing data as this is not recommended. I specifically want my testing data to go through all the preprocessing steps my training data went through. For example, I do not want step_normalize(all_predictors()) to normalize my testing data based on testing data mean values and SD, but based on the training data mean values etc.

I thought predict would do it like so:

predict(pca_prep, testind_data)

But it gives me an error saying:

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "recipe"

How can I achieve this?

Max · June 29, 2021, 4:26pm

Recipes don't do that; they are constrained to only use estimates/statistics/values from the training set. We never re-estimate.

If you are going to use the recipe with a model, we strongly advise using a workflow. See this book section. predict() will work in that case.

If you are going to use the recipe on its own, then see the documentation for prep() and bake() (bake() is analogous to predict()).

It is advisable to read more about recipes. It is very powerful and, with great power, comes a higher likelihood of getting tripped up.

zed · June 30, 2021, 8:05am

I actually will use the result of PCA as an input to an artificial neural network model using keras. But I think keras and recipes will not work together, right? In that case, is it more advisable that I preprocess my data with recipes and then save it so I can later use as an input to my neural networks? Or could keras and workflows somehow go together?

Max · June 30, 2021, 2:22pm

If you are putting this into a keras model, then define the recipe, estimate it using prep() and generate the preprocessed data (for all data sets) using bake(). Note that bake() has a composition argument where you can have the data already formatted to an R matrix.

system · July 21, 2021, 2:23pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.