High Accuracy- seems fishy

Shahna11 · March 16, 2019, 4:55pm

I am trying to build a Supervised Classification based Predictive Model. The data is consists of 13 qualitative variables. I built a predictor based on three columns and now I am trying to apply Logistic regression, SVM against it. I am getting 99% accuracy which doesn't seems right. Do anyone have any suggestions on what I might be doing wrong?

Thanks.

Yarnabrina · March 16, 2019, 5:21pm

Welcome to the community, Shahna.

Your question is not quite informative to provide any help. Can you please turn this into a reproducible example? If you don't know how, here's a great link:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Shahna11 · March 16, 2019, 5:45pm

I have a data set that has following structure:
|Company|Product|Item|Response|Dispute|Efficiency|
|C1|P1|I1|No|Yes|Good|
|C2|P2|I2|Yes|No|Bad|
|C3|P3|I3|No|No|Bad|
|C4|P4|I4|Yes|Yes|Moderate|

I created the efficiency column based on the Item, Response and Dispute value.
Predicted Efficiency based on rest of the predictors using Logistic regression.
The confusion matrix shows an accuracy as 99%.
This seems a little odd to me.

jasonparker · March 16, 2019, 7:07pm

How balanced is the data? Sometimes a binary classification like logistic regression will yield high accuracy because the data is highly imbalanced between the two classes. For example, if the real world truth of your data is that one class occurs 99% of the time, your model could achieve 99% accuracy by always guessing the same thing.

Yarnabrina is correct in stating that you need to provide some sort of reproducible example if you want the community to be able to give anything more than general statements in response to your question.

Shahna11 · March 16, 2019, 8:46pm

So here is the result of running the SVM algorithm:

library(e1071)
svm1 <- svm(Efficiency~., data=train,
method="C-classification", kernal="radial",
gamma=0.1, cost=10)
summary(svm1)
#--------
Call:
svm(formula = Efficiency ~ ., data = train, method = "C-classification",
kernal = "radial", gamma = 0.1, cost = 10)

Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
gamma: 0.1

Number of Support Vectors: 61

( 39 21 1 )

Number of Classes: 3

Levels:
Bad Good Moderate
#------------
prediction <- predict(svm1, train)
xtab <- table(train$Efficiency, prediction)
xtab

#------------------
prediction
Bad Good Moderate
Bad 1 0 0
Good 0 48 0
Moderate 0 0 21

Shahna11 · March 16, 2019, 8:48pm

Thank you for your reply Jason.
What would you suggest would be a right approach to deal with a highly imbalanced data?

system · April 6, 2019, 8:48pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.