Hello,
I am becoming familiar with different statistical models starting with linear regression (simple and multiple). I understand that many statistical models are sensitive to scaling (those that are distance based).
There are different types of scaling (standardization, centering, normalization, etc.). Scaling can certainly help with visualizations if one variable has a range that is way larger than the other (ex: scatterplot).
My question: does linear regression (simple or multiple) work better if the explanatory variables X and the response variable Y are first scaled? Or is scaling only necessary if the ranges of the X and Y variables are VERY different? And what type of scaling would be the most appropriate?
I understand there is not a single scaling solution but I wonder what is the best way to think and approach scaling...
The types of scaling you mention make linear (technically affine) transformations of variables. In a linear regression, this makes no substantive difference. One does have to be careful about interpretation of coefficients of course. If one divides a independent variable by 2, then the coefficient on that variable will exactly double.
Sometimes scaling can help with numerical properties in the calculations behind the linear regression. With modern software--like R--this is very unlikely to be an issue.
It depends on the estimation method. For ordinary least squares, there is no requirement to do so.
For any penalized model (e.g., glmnet, lasso, lars, ridge regression, principle component regression, etc.), you should have the predictors on the same scale.
In tidymodels, the man pages for each model type tells you when you should use such a method:
Preprocessing requirements
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.