I am trying to build a Supervised Classification based Predictive Model. The data is consists of 13 qualitative variables. I built a predictor based on three columns and now I am trying to apply Logistic regression, SVM against it. I am getting 99% accuracy which doesn't seems right. Do anyone have any suggestions on what I might be doing wrong?
Your question is not quite informative to provide any help. Can you please turn this into a reproducible example? If you don't know how, here's a great link:
I have a data set that has following structure:
|Company|Product|Item|Response|Dispute|Efficiency|
|C1|P1|I1|No|Yes|Good|
|C2|P2|I2|Yes|No|Bad|
|C3|P3|I3|No|No|Bad|
|C4|P4|I4|Yes|Yes|Moderate|
I created the efficiency column based on the Item, Response and Dispute value.
Predicted Efficiency based on rest of the predictors using Logistic regression.
The confusion matrix shows an accuracy as 99%.
This seems a little odd to me.
How balanced is the data? Sometimes a binary classification like logistic regression will yield high accuracy because the data is highly imbalanced between the two classes. For example, if the real world truth of your data is that one class occurs 99% of the time, your model could achieve 99% accuracy by always guessing the same thing.
Yarnabrina is correct in stating that you need to provide some sort of reproducible example if you want the community to be able to give anything more than general statements in response to your question.