Should I center/scale dummy variables?

AJF · October 25, 2019, 2:56pm

Hey all,

As a bit of a follow up to my previous question, I've seen disagreements online on whether I should center/scale my dummy variables prior to modeling. (look at the reprexes in the above link for an example). Andrew Gelman seems to say that I shouldn't, but Rob Tibshirani seems to say that I should.

Does anyone have any experience in this? Would it differ on whether I was using glmnet/LASSO vs keras/neural network?
(One of my favorite things about tree-based models like xgboost is that I don't have to think about these issues as much )

Thanks!

Max · October 31, 2019, 8:22pm

Should I center/scale dummy variables?

Yes, when the model requires the parameters to be on the same scale

regularized models (glmnet and like) have penalties on the sum of the slope values
nearest neighbor models use distance values and kernel methods (e.g. SVMs) use dot products
neural networks usually initialize using random numbers and assume the same scale of predictors
PLS models chase covariance and assume that the variances are the same.

and so on.

There is a decent argument to scale them all to a variance of two but, regardless, for some models you will do them harm if you do not normalize them as needed.

Agreed! Low maintenance is the way to go initially.

system · November 7, 2019, 8:22pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.