Scaling of unseen data in Principle Component Analysis (PCA)


In PCA technique, we scale the the training and test data simultaneously. Now to scale the unseen/prediction dataset, do we need to use the same scale i.e. (same mean and standard deviation) as used in the training and the test dataset or we can scale the unseen/prediction data on its own scale.


You have to scale using the training set only. The test set and any other future data should never be included in the construction of the scaling. This will lead to snooping and potentially overfitting.



For test data and future data scaling, do we need to scale it on the training data parameters i.e. (mean and standard deviation of training data).

To reinforce this, the examples I use during teaching is when the data that you are predicting:

  • is a single sample (i.e. n = 1) so that scaling cannot be done, or

  • come from a remote part of the predictor space. In this case, the means would not be similar to the data used to conduct the original PCA analysis

recipes and caret do pre-processing this way (and repeat them inside of resampling, which is also important).

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.