I have a statistical, not a programming, question. I am asking this here as I know that the rstanarm crew follows this site, and I am hoping that you all will not mind that I am asking a statistical, not a programming question.
Specifically, one is cautioned when using frequentist methods for hierarchical linear models that non-normality of the dependent variable (actually, of the residuals of the model) can lead to important errors in estimates of both coefficients and their confidence intervals. One can largely eliminate problems with non-normality by using dependent variable transformations such as the Box-Cox transform. But this is at the cost of leaving you with a non-linear model on the untransformed dependent variable, a real loss in ease of explanation and application.
My question is, "How important is non-normality of the residuals when one is using Bayesian methods (such as stan_lmer) to fit a hierarchical linear model?" I raise this question because, when using MCMC to fit a model, one uses the actual draws to calculate medians and credible intervals, rather than calculations dependent on the assumption of normality. I suspect that these will be accurate even when the residuals are modestly to moderately non-normal. On the other hand, one starts out with a prior assumption of normality of the residuals.
So, Bayesian experts, what's the answer to my question? Is dependent variable normality less important with a Bayesian analysis?
Thanks to the rest of you for putting up with a core statistics theory question.
Larry Hunsicker
I am no Bayesian expert but if the model assumes Gaussian error around the fit when that is not true, I do not see why the Bayesian calculation would not be affected just as much the frequentist approach. However, when using MCMC tools like Stan or JAGS, it is easy to build into the model non-Gaussian error. Here is the first hit I got on a web search for "robust Bayesian linear regression". I only mention that it was first because I only skimmed it. It seems like a very good explanation from what I saw.
I have not used rstanarm, so I cannot say how one would use that to model non-Gaussian error.
I have refined (and answered, I think) my question. There are two levels in regression at which one uses the assumption of normality. The first is in the determination of the maximum likelihood solution, and the second is in the calculation of the "cover" of the confidence or credible intervals. Dealing first with the second issue, with frequentist statistical methods, one calculates the confidence interval based on mean and variance, whereas in Bayesian analysis, one has constructed a sample of the distribution and determines cover directly from the sample. So Bayesian methods don't depend on the assumption of normality to determine credible intervals. Back to the first issue. In linear regression, the maximum likelihood solution is the least-squares solution -- the solution that minimizes the variance of the residuals. This depends only on the mean and variance -- the first two "moments." Therefore, (I think) it requires only homoscedasticity of the residuals, not necessarily on absence of skew or kurtosis. So, at least in linear regression, the Bayesian methods that assume a normal prior should be satisfied so long as the residuals can be assumed to be homoscedastic. Normality implies homoscedasticity, but is actually a stronger condition as it also implies absence of both skew and kurtosis. Much of the time, the "non-normality" of a distribution is due to presence of skew and/or kurtosis, and this is what Box-Cox can correct. But if my analysis above is correct, at least in linear reqression, one doesn't need a Box-Cox transform so long as the residuals are homoscedastic.
Would a real statistician comment on the above?
I certainly do no pass this check, but I still wanted to add a comment noting that the following does seem suspicious:
Why do you say so? Can you please explain your argument?
As far as I understand, the residuals can be assumed to be homoscedastic, when the variance of the residuals are not affected by the predictors. What has that anything to do with the distribution of the residuals? One can have normally distributed residuals, each with their own variance (think weighted least squares), isn't it?
If you (or someone else) thinks that this understanding is wrong, please do point out and help me to rectify myself.
Indeed, one can have two or more normal distributions that have different variances. But homo- or heteroscedacity is defined on the distributions, not on the individual errors. (Obviously the individual residuals can vary.) One uses variance weighted regression typically when one wants to combine data from two distributions (meta-analysis) where, again, the true means of the different distributions are assumed to be the same -- e.g., several sets of measurements of something with instruments of varying precision. There, the intent is to get the most accurate measure of the mean. But the variance of the sets combined with variance weighting does not necessarily equal the variances of any of the component distributions. I.e., it is not necessarily an accurate estimate of the variance that one would find using any specific method of unknown precision.
As used in describing simple linear regression analysis, one assumption of the fitted model (to ensure that the least-squares estimators are each a best linear unbiased estimator of the respective population parameters, by the GaussβMarkov theorem) is that the standard deviations of the error terms are constant and do not depend on the x -value. [where x may be a vector of variable values.. My interpolation.] Consequently, each probability distribution for y (response variable) has the same standard deviation regardless of the x -value (predictor). In short, this assumption is homoscedasticity. Homoscedasticity is not required for the estimates to be unbiased, consistent, and asymptotically normal.*
In sum, one needs only homoscedacticity, not necessarily full normality, for valid linear regression.
Finally, in response to the question whether normality implies homoscedasticity, another quotation from my first resource, Wikipedia (https://stats.stackexchange.com/questions/266504/why-is-the-normal-distribution-family-in-glms-homoscedastic): The normal family of distributions for a regression model ππβΌN(ππ=π₯πππ½,π2). This means that we assumes that our random variable ππ have some normal distribution with some unknown mean ππ=π₯πππ½ and unknown variance π2. This says that, whatever is the value of the predictor variables π₯π, the [expected -- my addition] variance of the observation is the same. That is homoskedasticity!