Logistic linear regression failed - wants to know other model which suits this

rajeshsao · June 3, 2018, 9:36am

I got the case study on banking datasets to identify loan defaulters. I tried to used logistic regression model to get inference of data. But it works only with binary dependent data. Can anyone help me to know which model will suits for this case study. I sharing the glimpse of dataset for reference.

The only categorical variable present in status which include A, B,C and D. I tried with status by using cbind keyword and got the value. post that when I use prediction and AOC then I failed to get the output. Throwing error only binary value 0<y<1 will be needed to run the code.

please help me to know which other model suits this kind of question if possible please share some example with me. so I can understand in better way.

Leon · June 3, 2018, 12:30pm

It is much easier to help you, if you supply us with a reprex.

If you would like to use logistic regression for multi-class classification, then you could use the one-vs-all approach, e.g. A vs B,C,D, so group A yes/no

rajeshsao · June 3, 2018, 5:39pm

Hypothesis -
The Loans Division of Bank want to know the accounts who are likely to default in repaying the loans when the contract ends

execution problem -

getting the following error while running confusion matrix (Error: data and reference should be factors with the same levels.) . please check and help me on this

loan <-
 read.csv("C:/Users/sao/Downloads/banking_data/Banking_Data/loan.txt", sep=';')
trans <- read.csv("C:/Users/sao/Downloads/banking_data/Banking_Data/trans.txt", sep=';') 
trans <- subset(trans, select = c(account_id,balance,k_symbol))

loanaccount <- merge(trans, loan, by="account_id")
loanaccount <- subset(loanaccount,select = -c(loan_id))

##checking missing value
is.na(loanaccount)
which(is.na(loanaccount))

##duplicated values
unique(loanaccount)
distinct(loanaccount) 

## create training and test data
install.packages("DMwR")
library(DMwR)
str(loanaccount)

##data split
datasplit <- sample(nrow(loanaccount), round(nrow(loanaccount)*0.8))
trainigdata <- loanaccount[datasplit,]
testdata <- loanaccount[-datasplit,]
unique(trainigdata)


## loan amount distribution and box plot
library(ggplot2)

give_count <-  stat_summary(fun.data = function(x) return(c(y = median(x)*1.06,                                             label = length(x))),
               geom = "text")

give_mean <- 
  stat_summary(fun.y = mean, colour = "darkgreen", geom = "point", 
               shape = 18, size = 3, show.legend = FALSE)

ggplot(trainigdata, aes(x=k_symbol, y=amount))+ +
  geom_boxplot(outlier.colour="black", outlier.shape=16,outlier.size=2, notch=FALSE) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  labs(title="Loan Amount by status", x = "loan purpose", y = "Loan Amount \n")
 


## summary on training dataset
summary(trainigdata)
summary(trainigdata$status)
summary(trainigdata$k_symbol)

## t-test result
install.packages("graphics")
library(graphics)
install.packages("pwr")
library(pwr)
install.packages("nparcomp")
library(nparcomp)
t.test(trainigdata$amount, testdata$amount)

t.test(trainigdata$amount, loanaccount$amount)

## making tree model from train data
install.packages("tree")
library(tree)
train.loan <- tree(status~.-duration-date-payments-account_id, testdata)
plot(train.loan)
text(train.loan, pretty=0)
summary(train.loan)

## tree data prediction
treeloanprediction <- predict(train.loan,trainigdata, type = "class")


##logistic regression

lmloan <- glm(cbind(account_id,status)~.-payments,family="binomial", trainigdata)

summary(lmloan)$coeff
plot(lmloan)

##predict

predictlm <- predict(lmloan,newdata = testdata, type="response")
predictlm
## confufusion matrix sensitivy, secifity
library(heuristica)
library(caret)
library(ROCR)
library(stringi)
model_glm <- predict.glm(lmloan, testdata, type = "response", na.action = na.pass)
model_predict <- function(pred, t) ifelse (pred>t, TRUE, FALSE)
testdata <- testdata[complete.cases(testdata),]
caret::confusionMatrix(model_predict(model_glm, 0.5), reference = testdata, positive="TRUE")


## test set area under the curve
library(ROCR)

rocrpred <- prediction(model_glm, trainigdata$status)

pred <- prediction(predicttestdata,testdata$status)

as.numeric(performance(pred, "auc")@y.values)

Leon · June 3, 2018, 6:46pm

First things first... A factor variable is a categorical variable. The levels of a factor variable is the possible categories, the value of the variable can fall in. E.g.:

> factor(sample(LETTERS, 10), levels = LETTERS)
 [1] M Y D B Z I G R T Q
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Note how there are additional levels other than the value of the variable. Your error has to with you presumably comparing factor variables with different levels.

You can get the levels of a factor variable, like so:

> my_factor_var = factor(sample(LETTERS, 10), levels = LETTERS)
> my_factor_var
 [1] I V H X M L Z W Y E
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
> levels(my_factor_var)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

You can set the levels of a factor variable like so:

factor(my_factor_var, levels = unique(my_factor_var))
 [1] I V H X M L Z W Y E
Levels: I V H X M L Z W Y E

So bottomline - Try to look into factor variables and then check your confusion matrix again

rajeshsao · June 4, 2018, 6:04am

But in logistic regression the only binary (1,0) will work. Here status variable composite four factors A, B, C and D. So I am not sure about this model. Please, can you help me select the suitable model to identify loan defaulters for this case. Or this composite status variable will work with logistic regression?

Leon · June 4, 2018, 6:40pm

I cannot tell you which model to use, you will have to look at your data and the question you want to ask your data. From what you've written it seems like you have to look into multi-class classification... As I wrote earlier you can try using the one-vs-all approach with logistic regression.