I followed this tutorial to do PCA preprocessing on my training data. Here is what I have done so far:
# load libraries
library(tidymodels)
library(tidyverse)
# build PCA model
pca_recipe = recipe(~., data = training_data) %>%
update_role(232:236, new_role = "id") %>% # these are the outcomes (y labels) that I will use to train a ML model
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), num_comp = 10)
pca_prep = prep(pca_recipe)
pca_tidy = tidy(pca_prep, 2)
Now, in the language of linear algebra: I want to send my testing data to the same space with this trained PCs. I do not want to do another PCA for the testing data as this is not recommended. I specifically want my testing data to go through all the preprocessing steps my training data went through. For example, I do not want step_normalize(all_predictors()) to normalize my testing data based on testing data mean values and SD, but based on the training data mean values etc.
I thought predict would do it like so:
predict(pca_prep, testind_data)
But it gives me an error saying:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "recipe"
I actually will use the result of PCA as an input to an artificial neural network model using keras. But I think keras and recipes will not work together, right? In that case, is it more advisable that I preprocess my data with recipes and then save it so I can later use as an input to my neural networks? Or could keras and workflows somehow go together?
If you are putting this into a keras model, then define the recipe, estimate it using prep() and generate the preprocessed data (for all data sets) using bake(). Note that bake() has a composition argument where you can have the data already formatted to an R matrix.