Multinominal logistic regression - bad accuracy

pinkfuchs · November 6, 2019, 12:31am

Hello everyone!
I have to do a multinominal logistic regression for my bachelor thesis, since i got a multinominal dependent variable and more metric independent variables.
i used the following script, but my accuracy is at 56.4 and this is really low, right?
(I´m an absolute r studio beginner).
Can i still use the data? or is there a way to improve the accuracy of the test?
i already checked for multicollinearity and i deleted the outliers (before i deleted them i had higher accuracy, should i let them in?
i followed this site:
https://datasciencebeginners.com/2018/12/20/multinomial-logistic-regression-using-r/
I´d be really happy for every advice !

Your´s sincerly, Lea

tabelle<- read.csv("tabelleohneausreißer.csv",sep=";")
fix (tabelle)
train <- sample_frac(tabelle,0.7)
sample_id <- as.numeric(rownames(train))
test <- tabelle [-sample_id,]
train$a<- relevel(train$a,ref="Vegetation")
require(nnet)
multinom.fit <- multinom(a~+b+c+d+e -1,data=train)

weights: 18 (10 variable)

initial value 274.653072
iter 10 value 220.480331
final value 220.404239
converged

summary (multinom.fit)
Call:
multinom(formula = a ~ b + c + d + e + f - 1,
data = train)

Coefficients:
b c d e
Merkmala -0.002147572 -0.04805716 -0.9840262 -1.150613
f 0.001257216
Merkmalb 0.104333462 0.04476998 -2.1787026 1.591302 -0.004514558

Std. Errors:
b c d e
Merkmala 0.02384957 0.03294757 0.8306937 0.9734560 f
0.0009642184
Merkmalb 0.03884706 0.05458862 0.1015885 0.2658659 0.0014501817

Residual Deviance: 440.8085
AIC: 460.8085

exp(coef(multinom.fit))
b c d e f
Merkmala 0.9978547 0.9530793 0.3738031 0.3164429 1.0012580
Merkmalb 1.1099705 1.0457873 0.1131883 4.9101396 0.9954956
head (probability.table <- fitted(multinom.fit))
Vegetation Merkmala Merkmalb
1 0.6068753 0.3085744 0.08455033
2 0.4620414 0.3012380 0.23672058
3 0.6740011 0.2457066 0.08029228
4 0.6229242 0.2467514 0.13032444
5 0.5430746 0.4423710 0.01455433
6 0.6343817 0.3119502 0.05366813
train$precticed <- predict(multinom.fit,newdata=train,"class")
ctable <- table(train$a, train$precticed)
ctable <- table(train$a, train$precticed)
round((sum(diag(ctable))/sum(ctable))*100,2)
[1] 56.4
test$precticed <- predict(multinom.fit,newdata=test,"class")
ctable <- table (test$a,test$precticed)
round((sum(diag(ctable))/sum(ctable))*100,2)
[1] 42.99
min (Inf)

technocrat · November 6, 2019, 6:36am

Hi, and welcome!

There's a convention here about questions that are part of degree requirements, which is shorthanded as homework, and another about reproducible example, called a reprex, both of which will help you get improved responses and better answers.

So, 1) understand that we will help explain and point you in the right direction, 2) we may provide code for specific portions of code, but won't give you a complete solution, and 3) it really helps to have a reprex that includes the data or a representative extract, rather than

tabelle<- read.csv("tabelleohneausreißer.csv",sep=";")

which is nowhere to be found.

So, let's start from the basics.

Your problem involves a multinominal dependent variable, and you are applying a script at the link that states

Your dependent variable must be Nominal . This does not mean that multinomial regression cannot be used for the ordinal variable. However, for multinomial regression, we need to run ordinal logistic regression.

The first step in getting back on track is reproducing the example you are working from

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(rattle.data)
data(wine)
train <- sample_frac(wine, 0.7)
sample_id <- as.numeric(rownames(train)) # rownames() returns character so as.numeric
test <- wine[-sample_id,]
head(test) # shortened from example
#>     Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
#> 126    2   12.07  2.16 2.17       21.0        85    2.60       2.65
#> 127    2   12.43  1.53 2.29       21.5        86    2.74       3.15
#> 128    2   11.79  2.13 2.78       28.5        92    2.13       2.24
#> 129    2   12.37  1.63 2.30       24.5        88    2.22       2.45
#> 130    2   12.04  4.30 2.38       22.0        80    2.10       1.75
#> 131    3   12.86  1.35 2.32       18.0       122    1.51       1.25
#>     Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
#> 126          0.37            1.35  2.76 0.86     3.28     378
#> 127          0.39            1.77  3.94 0.69     2.84     352
#> 128          0.58            1.76  3.00 0.97     2.44     466
#> 129          0.40            1.90  2.12 0.89     2.78     342
#> 130          0.42            1.35  2.60 0.79     2.57     580
#> 131          0.21            0.94  4.10 0.76     1.29     630
train <- sample_frac(wine, 0.7)
sample_id <- as.numeric(rownames(train)) # rownames() returns character so as.numeric
test <- wine[-sample_id,]
train$Type <- relevel(train$Type, ref = "3")
require(nnet)
#> Loading required package: nnet
multinom.fit <- multinom(Type ~ Alcohol + Color -1, data = train)
#> # weights:  9 (4 variable)
#> initial  value 137.326536 
#> iter  10 value 78.435565
#> final  value 78.365718 
#> converged
exp(coef(multinom.fit))
#>    Alcohol      Color
#> 1 1.482424 0.44367306
#> 2 2.541586 0.08372892
head(probability.table <- fitted(multinom.fit))
#>             3         1            2
#> 1 0.006129304 0.0905185 0.9033521930
#> 2 0.047171515 0.3049780 0.6478504747
#> 3 0.836364425 0.1634788 0.0001567621
#> 4 0.332745275 0.6129917 0.0542630372
#> 5 0.598866043 0.3970107 0.0041232804
#> 6 0.025020422 0.2518637 0.7231158655
train$precticed <- predict(multinom.fit, newdata = train, "class")
ctable <- table(train$Type, train$precticed)
round((sum(diag(ctable))/sum(ctable))*100,2)
#> [1] 71.2
test$precticed <- predict(multinom.fit, newdata = test, "class")
ctable <- table(test$Type, test$precticed)
round((sum(diag(ctable))/sum(ctable))*100,2)
#> [1] 9.43

^{Created on 2019-11-05 by the reprex package (v0.3.0)}

Now, I didn't spend a lot of time on this, but I did notice that the statements

Accuracy in training dataset is 68.8%

and

The accuracy of the test dataset turns out to be 18.4% less as compared to training dataset

are inconsistent with the results of the code.

That raises a meta-question: Is this the right example?

pinkfuchs · November 6, 2019, 12:19pm

Hey technocrat, thank you for your answer, I just tried to create a new topic with the right categories but it won´t let me make a new one and delete this one in the first days on this site
I think there´s a little missunderstanding, i just jused the instruction in the link as a guidance and used my own data. I copied the things i wrote in r, but i don´t know how to upload my whole data. If it helps i can send it per email or something. or is there a upload button anywhere?

And in my own data the accuracy is at 56.4, and i don´t know if i still can use this data

technocrat · November 6, 2019, 8:44pm

My suggestion for you was to start with applying the guidelines to reproduce their example. If you can't with their data, it probably won't be possible with your own. When we encourage reproducible example, called a reprex, it's for the purpose of illustrating the problem and the specific points in the code that may be roadblocks.

Let's go back to the guidelines. I got different results than they did. One was close and the other was way off. If this were my own work, I'd go looking after a better example that I could reproduce. Once I did, I'd check my understanding of how the roles of the dependent and independent variables in my data relate to the example.

In your work, I strongly encourage you to review the basics of logistic regression (which your code doesn't do) for independent variables that can take on more than two values. This is covered in Chapter 5 of Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant. Applied logistic regression . Hoboken, New Jersey: Wiley, 2013, the standard text on logistic regression.

The analysis that your code is set up to do is a predictive type of machine learning that is well described in @rafalab's free R course textbook in Section 33.7.

pinkfuchs · November 7, 2019, 12:28am

Thank you a lot for the answer ! I´ll read this chapter tomorrow, hopefully i´ll get this !
You were a big help; Thanks a lot !

system · November 28, 2019, 12:28am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.