Testing multicollinearity in a linear regression with interaction effect

Hey there!
I'm still a newbie and need some help for my linear regression model. Here are my models I've created for my regression.

model_1 <- lm(data=Loan, ApplicantIncome ~ Gender)
model_2 <- lm(data=Loan, ApplicantIncome ~ Married)
model_3 <-lm(data=Loan, ApplicantIncome ~ Married * Gender * Dependents)
model_4 <- lm(data=Loan, ApplicantIncome ~ Gender + Married + Dependents + Education)
As you can see, I included a double interaction effect in the regression. Now I'm desperately looking for a method to test the multicollinearity for my model. As I know, we visualized the full model with the vif function, but it seems to be better if I would visualize each model for its own due to the interaction effect. Do you have any suggestions?
Find attached the dataset from kaggle.
Loan Dataset | Kaggle

Thanks in advance!

Out of curiosity, why do you care about testing for multicollinearity?

1 Like

It's part of an assignment for my study program. So I thought it would be necessary for the linear regression and it would be a part of it. Isn't it?

Well, if it's part of your assignment you should certainly do it! :grinning:

Concern about multicollinearity is way, way overblown. The purpose of a multiple regression is to handle multicollinear variables and to give the best estimates that can be done. Under the standard assumptions that make a multiple regression valid, multicollinearity is not a problem. That is, one wishes there were not multicollinearity but whatever is in the data is just there. Wishing for no multicollinearity is like wishing for more observations. Would be nice, but not much to be done about it.

(Perfect multicollinearity is a problem. But (a) it doesn't need to be tested for as the software will either issue a warning or drop a variable, and (b) 99.9 percent of the time means that the regression is misspecified.)

1 Like

Okay, that makes sense! I just tried create a cor matrix, but there is still a problem cause the variable "Dependents" is defined with 0,1,2,3+ and therefore is defined as character not as nummeric.
I also asked my sister about the need of testing it and she told me either that it was a part in their study program as she studied supply chain management.

Use as.numeric() to convert characters to numbers.

But therefore I have to convert the value 3+ in 3 , right? Because there is a sign included in this value.

Sorry. You're right of course that you'll have to do something about the plus sign first.

You may want to make dependents a factor, rather than treating the relation as linear. In that case, use as.factor rather than as.numeric.

#Loan$Dependents <- as.factor(Loan$Dependents)
So I used this code to convert Dependents to a factor. But for creating a cor matrix, all variables have to be nummeric cause I'm getting the error that x has to be nummeric in case of Gender, Dependents and Married. I checked it with the summary function and the problem should be Dependents.
Gender Married Dependents
Min. :0.0000 Min. :0.0000 Length:614
1st Qu.:1.0000 1st Qu.:0.0000 Class :character
Median :1.0000 Median :1.0000 Mode :character
Mean :0.8136 Mean :0.6514
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
NA's :13 NA's :3

Min. :0.0000
1st Qu.:1.0000
Median :1.0000
Mean :0.7818
3rd Qu.:1.0000
Max. :1.0000

The underlying problem is that "3+" really isn't a number. One doesn't know if there are 3 dependents or 35.

If you want to treat "3+" as 3, or some other number, convert "3+" to "3" and then use as.numeric().

Okay, I did this.

Loan$Dependents[Loan$Dependents == "3+"] <- "3"
Loan$Dependents <- as.numeric(Loan$Dependents)
num [1:614] 0 1 0 0 0 2 0 3 2 1 ...
And it seems to be numeric, but then I added the interactions as a new variable.
Loan$Married_Gender <- with(Loan, interaction(Married, Gender, drop = TRUE))
Loan$Married_Dependents <- with(Loan, interaction(Married, Dependents, drop = TRUE))
Loan$Gender_Dependents <- with(Loan, interaction(Gender, Dependents, drop = TRUE))
And used the code to create the cor matrix.
cor_matrix <- cor(Loan[, c("ApplicantIncome", "Married", "Gender", "Dependents", "Married_Gender", "Married_Dependents", "Gender_Dependents")])

And I still got this error.

Error in cor(Loan[, c("ApplicantIncome", "Married", "Gender", "Dependents", :
'x' must be numeric

Take a look at your data in the environment window.

Okay, it seems like the interactions "Married_Gender", "Married_Dependents", "Gender_Dependents" are still defined as a factor, so I have to use as_numeric again, right?

               ApplicantIncome Married Gender Dependents Married_Gender

ApplicantIncome 1 NA NA NA NA
Married NA 1 NA NA NA
Gender NA NA 1 NA NA
Dependents NA NA NA 1 NA
Married_Gender NA NA NA NA 1
Married_Dependents NA NA NA NA NA
Gender_Dependents NA NA NA NA NA
Married_Dependents Gender_Dependents
ApplicantIncome NA NA
Married NA NA
Gender NA NA
Dependents NA NA
Married_Gender NA NA
Married_Dependents 1 NA
Gender_Dependents NA 1

Okay, those are the results. It depends, sometimes it is necessary to let the NA's in the dataset because there is an important information included, but in this case it would be better to delete the NA's due to this outcome?

Look at the variable Gender.

Also, you might think about what you're hoping to do with a correlation with interactions.

There is only one value (1), but if you take a look at the table there are also zeros. I defined dummies before cause in the beginning it was a category with male and female.

I tried to proof the multicollinearity with the cor matrix cause the vif function wasn't working. Are there other possibilities?

When I look at Gender, it's characters.

In any event, looking at a correlation matrix won't necessarily help with multicollinearity because it only tells you about correlations between pairs.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.