Regression in R for categorical variables

Elise2992 · February 23, 2023, 7:56pm

Hi,

I am a complete Rstudio beginner and for my master's thesis I was asked to conduct a regression.
The problem is that all of my data are categorical data and I do not know how to do this.
I tried to put every variable as a factor, but this always gave an error. I also tried to make a dummy variable for every variable in my data but the regression results I get is not what i am looking for, I just need the significance per category.

Hopefully someone can help solving this problem.

Thank you in advance!

Elise

technocrat · February 23, 2023, 9:33pm

RStudio, and R itself, are merely tools in statistical analysis. Before applying those tools to data, it is important clearly to articulate the intended outcome. Otherwise, the situation is like bringing a load of building materials to a construction site to see what can be made of them.

Regression is a broad category of statistical tests design to determine the association between variables. In the simple case of ordinary least squares regression, as implemented by the lm() function, for example, it evaluates P(Y|X), the probability of observing Y in the presence of X. Here's a trivial example

fit <- lm(mpg ~ ., mtcars)
summary(fit)
#> 
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.4506 -1.6044 -0.1196  1.2193  4.6271 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) 12.30337   18.71788   0.657   0.5181  
#> cyl         -0.11144    1.04502  -0.107   0.9161  
#> disp         0.01334    0.01786   0.747   0.4635  
#> hp          -0.02148    0.02177  -0.987   0.3350  
#> drat         0.78711    1.63537   0.481   0.6353  
#> wt          -3.71530    1.89441  -1.961   0.0633 .
#> qsec         0.82104    0.73084   1.123   0.2739  
#> vs           0.31776    2.10451   0.151   0.8814  
#> am           2.52023    2.05665   1.225   0.2340  
#> gear         0.65541    1.49326   0.439   0.6652  
#> carb        -0.19942    0.82875  -0.241   0.8122  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.65 on 21 degrees of freedom
#> Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
#> F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

^{Created on 2023-02-23 with reprex v2.0.2}

This models gas mileage, mpg, as a function of all of the other variables in the dataset. Notice that adjusted R^2 appears to indicate a relatively high degree of association, yet none of the variables individually, have a p-value anywhere near the conventional 0.05 threshold but for wt.

However, fitting a model with only one independent variable shows a different result.

fit <- lm(mpg ~ drat, mtcars)
summary(fit)
#> 
#> Call:
#> lm(formula = mpg ~ drat, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.0775 -2.6803 -0.2095  2.2976  9.0225 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -7.525      5.477  -1.374     0.18    
#> drat           7.678      1.507   5.096 1.78e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
#> F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

This time adjusted R^2 is not nearly as "good", but the p-value of the drat variable has shrunk from 0.64 to vanishingly small. What happened?

For adjusted R^2—overfitting; adding variables increases the statistic. For the change in the p-value, the test statistic depends on the residual of a variable from a "best fit" line threading all the variables included in the model; when the best fit line changes, the residual changes.

What to make of this? Application of regression, even in the simple case, is not a mechanical process. Attention must be paid to what the test statistics actually represent.

In addition to that pitfall, not all flavors of regression are equally applicable to all combinations of variables. For example, let Y be a binary outcome—yes/no, TRUE/FALSE, 1/0, lives/dies. (Y is the conventional notation for a dependent variable). For that case, lm() doesn't work, glm() is needed.

How about categorical variables, such as yours?

They differ from continuous variables, such as mpg, that can in theory take on an infinite range of values. In addition to binary variables (dichotomous) some variables can take on only a set of values—high, medium, low, for example (polytomous). These are categorical, and they may be ordered, like age cohorts, or unordered, like puzzle shapes. These non-continuous cases are also referred to as discrete data.

For a binary Y, the glm() function with the binomial family option is appropriate. For multiple possible categories of outcome, there is the proportional odds model if certain assumptions are satisfied, and the generalized logit model.

A good set of tools for the categorical data problem is the {rms} and the the related text {vcd} along with {vcdExtra} packages with {vcd} and text.

Then, there is the poisson distribution for use in count data, a different set for time varying data \dots .

startz · February 24, 2023, 12:27am

@technocrat gives good advice. Let me add something a little different.

(1) When you say you get an error, show us the code and the error so we might have some idea what caused it.

(2) If you want to ask whether a category as a whole matters, include all the dummy variables for that category and then do an F-test for the hypothesis that all those coefficients equal zero.

Elise2992 · February 24, 2023, 11:34am

Thank you both very much!
Because of confidentiality I can not give to much information. I am sorry.

The error i get:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion

I used code such as this one: Casevoortxt2Engels[is.na(Casevoortxt2Engels) | Casevoortxt2Engels=="Inf"] = NA
To look for possible NA/NaN but I do not know what the exact problem is because I can not find anything.

startz · February 24, 2023, 2:12pm

This suggests that some of the right hand side variables are either missing or infinite. Take a look at your data to see if this is true.

Try using lm() instead of lm.fit() as the former does much more error checking.

system · March 17, 2023, 2:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.