How to run categorical dummy variables in R?

JoeyBarnes20 · December 1, 2018, 1:50am

Hello, first I would like to apologize if I am asking this in the wrong category or if this question has been posted before. I'm a college student and we need to do our econometrics term paper using R. However, our professor never taught us how we can run a regression for categorical variables...

The trouble I am having is in order to avoid perfect multicollinearity you need to n-1 variables. How would I do this? For example, I have 5 categories and I want R to only include 4 in the regression and use the excluded one as the base group. This is how my data is set up

I am analyzing the impact of the height of NBA players on their salary while controlling for position. I want shooting guard (SG) to be my reference group.
So my regression formula is:

Reg2 <- lm(SALARY~ Height+PG+PF+SF+Center)
summary(Reg2)

Call:
lm(formula = SALARY ~ Height + PG + PF + SF + Center)

Residuals:
Min 1Q Median 3Q Max
-4845982 -2412751 -346673 2224250 7143761

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82417024 27550269 2.992 0.00514 **
Height -767197 351353 -2.184 0.03599 *
PG 6282111 2304350 2.726 0.01005 *
PF 5353895 2482486 2.157 0.03819 *
SF 5428714 2335014 2.325 0.02618 *
Center 6404938 3096554 2.068 0.04628 *

BrentHutto · December 1, 2018, 11:54am

It seems to me your model you came up with will work just like you wanted.

Each of the effects for PG, PF, SF and Center would be interpreted as the difference in salary for that position relative to the salary of a SG (your referent group).

It may be beyond the scope of your assignment but generally speaking variables like "salary" or "income" often cause problems due to highly skewed distributions. So be careful to make sure the model is valid in that respect.

But your actual coding of the model seems good!

system · December 22, 2018, 12:05pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.