Hey there!
I'm still a newbie and need some help for my linear regression model. Here are my models I've created for my regression.
model_1 <- lm(data=Loan, ApplicantIncome ~ Gender)
model_2 <- lm(data=Loan, ApplicantIncome ~ Married)
model_3 <-lm(data=Loan, ApplicantIncome ~ Married * Gender * Dependents)
model_4 <- lm(data=Loan, ApplicantIncome ~ Gender + Married + Dependents + Education)
As you can see, I included a double interaction effect in the regression. Now I'm desperately looking for a method to test the multicollinearity for my model. As I know, we visualized the full model with the vif function, but it seems to be better if I would visualize each model for its own due to the interaction effect. Do you have any suggestions?
Find attached the dataset from kaggle. Loan Dataset | Kaggle
Well, if it's part of your assignment you should certainly do it!
Concern about multicollinearity is way, way overblown. The purpose of a multiple regression is to handle multicollinear variables and to give the best estimates that can be done. Under the standard assumptions that make a multiple regression valid, multicollinearity is not a problem. That is, one wishes there were not multicollinearity but whatever is in the data is just there. Wishing for no multicollinearity is like wishing for more observations. Would be nice, but not much to be done about it.
(Perfect multicollinearity is a problem. But (a) it doesn't need to be tested for as the software will either issue a warning or drop a variable, and (b) 99.9 percent of the time means that the regression is misspecified.)
Okay, that makes sense! I just tried create a cor matrix, but there is still a problem cause the variable "Dependents" is defined with 0,1,2,3+ and therefore is defined as character not as nummeric.
I also asked my sister about the need of testing it and she told me either that it was a part in their study program as she studied supply chain management.
#Loan$Dependents <- as.factor(Loan$Dependents)
So I used this code to convert Dependents to a factor. But for creating a cor matrix, all variables have to be nummeric cause I'm getting the error that x has to be nummeric in case of Gender, Dependents and Married. I checked it with the summary function and the problem should be Dependents.
Gender Married Dependents
Min. :0.0000 Min. :0.0000 Length:614
1st Qu.:1.0000 1st Qu.:0.0000 Class :character
Median :1.0000 Median :1.0000 Mode :character
Mean :0.8136 Mean :0.6514
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
NA's :13 NA's :3
Education
Min. :0.0000
1st Qu.:1.0000
Median :1.0000
Mean :0.7818
3rd Qu.:1.0000
Max. :1.0000
Loan$Dependents[Loan$Dependents == "3+"] <- "3"
Loan$Dependents <- as.numeric(Loan$Dependents)
str(Loan$Dependents)
num [1:614] 0 1 0 0 0 2 0 3 2 1 ...
And it seems to be numeric, but then I added the interactions as a new variable.
Loan$Married_Gender <- with(Loan, interaction(Married, Gender, drop = TRUE))
Loan$Married_Dependents <- with(Loan, interaction(Married, Dependents, drop = TRUE))
Loan$Gender_Dependents <- with(Loan, interaction(Gender, Dependents, drop = TRUE))
And used the code to create the cor matrix.
cor_matrix <- cor(Loan[, c("ApplicantIncome", "Married", "Gender", "Dependents", "Married_Gender", "Married_Dependents", "Gender_Dependents")])
And I still got this error.
Error in cor(Loan[, c("ApplicantIncome", "Married", "Gender", "Dependents", :
'x' must be numeric
Okay, it seems like the interactions "Married_Gender", "Married_Dependents", "Gender_Dependents" are still defined as a factor, so I have to use as_numeric again, right?
ApplicantIncome Married Gender Dependents Married_Gender
ApplicantIncome 1 NA NA NA NA
Married NA 1 NA NA NA
Gender NA NA 1 NA NA
Dependents NA NA NA 1 NA
Married_Gender NA NA NA NA 1
Married_Dependents NA NA NA NA NA
Gender_Dependents NA NA NA NA NA
Married_Dependents Gender_Dependents
ApplicantIncome NA NA
Married NA NA
Gender NA NA
Dependents NA NA
Married_Gender NA NA
Married_Dependents 1 NA
Gender_Dependents NA 1
Okay, those are the results. It depends, sometimes it is necessary to let the NA's in the dataset because there is an important information included, but in this case it would be better to delete the NA's due to this outcome?
There is only one value (1), but if you take a look at the table there are also zeros. I defined dummies before cause in the beginning it was a category with male and female.
I tried to proof the multicollinearity with the cor matrix cause the vif function wasn't working. Are there other possibilities?
In any event, looking at a correlation matrix won't necessarily help with multicollinearity because it only tells you about correlations between pairs.