Confused regarding output of cv.glmnet predicted values for logistic regression

bigtree · March 26, 2022, 7:17pm

I am using the glmnet packageto perform logistic regression on a dataset.
The x.train and x.test data is a simple dataset of numbers.
y.train and y.test is data with categories "Coffee" and "Tea".
Basically the prediction needs to be either "Coffee" or "Tea"

My first question is that do I need to factor the y datasets? I haven't factored them yet.

Secondly, and mainly, this is the problem:

My code is as follows:

lr.fit<-cv.glmnet(x.train, y.train, type.measure="deviance", family = "binomial")
lr.predicted<-predict(lr.fit, s=c("lambda.1se", "lambda.min"), newx=x.test)

However, when I see the output of the lr.predicted variable, I see a list of numbers. I am asking this question because, I was actually expecting predictions like "Coffee", "Tea", "Coffee", "Coffee", "Tea", ........... and so on.

Kindly guide in the right direction. I am a beginner with R and machine learning, so apologies for being an amateur.

mattwarkentin · March 27, 2022, 6:09pm

Hi @bigtree,

Logistic regression models generally make predictions as probabilities (or logits, which are a transformed versions of probabilities). So these are the numbers you are getting back from predict.cv.glmnet(). You can transform these probabilities to classes by threshold your predictions at some value.

For example, if your Y variable is a factor variables with levels 0 = 'Coffee' and 1 = 'Tea', then you can take the probabilities and threshold them at 0.5, whereby any prediction below 0.5 is classified as 'Coffee' and predictions above 0.5 are "Tea".

Hope this is helpful.

system · April 17, 2022, 6:09pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.