RStudio, and R itself, are merely tools in statistical analysis. Before applying those tools to data, it is important clearly to articulate the intended outcome. Otherwise, the situation is like bringing a load of building materials to a construction site to see what can be made of them.
Regression is a broad category of statistical tests design to determine the association between variables. In the simple case of ordinary least squares regression, as implemented by the lm()
function, for example, it evaluates P(Y|X), the probability of observing Y in the presence of X. Here's a trivial example
fit <- lm(mpg ~ ., mtcars)
summary(fit)
#>
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.4506 -1.6044 -0.1196 1.2193 4.6271
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.30337 18.71788 0.657 0.5181
#> cyl -0.11144 1.04502 -0.107 0.9161
#> disp 0.01334 0.01786 0.747 0.4635
#> hp -0.02148 0.02177 -0.987 0.3350
#> drat 0.78711 1.63537 0.481 0.6353
#> wt -3.71530 1.89441 -1.961 0.0633 .
#> qsec 0.82104 0.73084 1.123 0.2739
#> vs 0.31776 2.10451 0.151 0.8814
#> am 2.52023 2.05665 1.225 0.2340
#> gear 0.65541 1.49326 0.439 0.6652
#> carb -0.19942 0.82875 -0.241 0.8122
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.65 on 21 degrees of freedom
#> Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
#> F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Created on 2023-02-23 with reprex v2.0.2
This models gas mileage, mpg
, as a function of all of the other variables in the dataset. Notice that adjusted R^2 appears to indicate a relatively high degree of association, yet none of the variables individually, have a p-value
anywhere near the conventional 0.05 threshold but for wt
.
However, fitting a model with only one independent variable shows a different result.
fit <- lm(mpg ~ drat, mtcars)
summary(fit)
#>
#> Call:
#> lm(formula = mpg ~ drat, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -9.0775 -2.6803 -0.2095 2.2976 9.0225
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7.525 5.477 -1.374 0.18
#> drat 7.678 1.507 5.096 1.78e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared: 0.464, Adjusted R-squared: 0.4461
#> F-statistic: 25.97 on 1 and 30 DF, p-value: 1.776e-05
This time adjusted R^2 is not nearly as "good", but the p-value of the drat
variable has shrunk from 0.64 to vanishingly small. What happened?
For adjusted R^2—overfitting; adding variables increases the statistic. For the change in the p-value, the test statistic depends on the residual of a variable from a "best fit" line threading all the variables included in the model; when the best fit line changes, the residual changes.
What to make of this? Application of regression, even in the simple case, is not a mechanical process. Attention must be paid to what the test statistics actually represent.
In addition to that pitfall, not all flavors of regression are equally applicable to all combinations of variables. For example, let Y be a binary outcome—yes/no, TRUE/FALSE
, 1/0, lives/dies. (Y is the conventional notation for a dependent variable). For that case, lm()
doesn't work, glm()
is needed.
How about categorical variables, such as yours?
They differ from continuous variables, such as mpg
, that can in theory take on an infinite range of values. In addition to binary variables (dichotomous) some variables can take on only a set of values—high, medium, low, for example (polytomous). These are categorical, and they may be ordered, like age cohorts, or unordered, like puzzle shapes. These non-continuous cases are also referred to as discrete data.
For a binary Y, the glm()
function with the binomial family option is appropriate. For multiple possible categories of outcome, there is the proportional odds model if certain assumptions are satisfied, and the generalized logit model.
A good set of tools for the categorical data problem is the {rms}
and the the related text {vcd}
along with {vcdExtra}
packages with {vcd}
and text.
Then, there is the poisson distribution for use in count data, a different set for time varying data \dots .