Logistic Regression model: " glm.fit: fitted probabilities numerically 0 or 1 occurred "

blackish952 · June 16, 2018, 3:19am

I am currently having two issues:

When I build the logistic regression model using glm() package, I have an original warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
One article on stack-overflow said I can use Firth's reduced bias algorithm to fix this warning, but then when I use logistf, the process seems to take too long so I have to terminate it. It may be due to me running a data set of 183,300 rows....

How can I approach this issue?

Thanks!

JohnMount · June 16, 2018, 2:38pm

I would suggest giving glmnet a try- it introduces a regularization that can help a bit and should be performant.

On the issue of 0/1 probabilities: it means your problem has separation or quasi-separation (a subset of the data that is predicted perfectly and may be running a subset of the coefficients out to infinity). That can cause problems, so you will want to look at the coefficients (especially those that are large and have large uncertainty intervals) and at data with probability scores near zero or one (or link-scores with large absolute values).

blackish952 · June 18, 2018, 12:44pm

@JohnMount

> logitmod <- glm(Class ~ cust_prog_level + CUST_REGION_DESCR + Sales , data=down_train, family = "binomial")
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> logitmod <- glmnet(Class ~ cust_prog_level + CUST_REGION_DESCR + Sales , data=down_train, family = "binomial")
Error in glmnet(Class ~ cust_prog_level + CUST_REGION_DESCR + Sales, data = down_train,  : 
  unused argument (data = down_train)

please help.

jcblum · June 18, 2018, 12:59pm

glmnet is not a drop-in replacement for stats::glm — it has its own argument structure (in particular, it does not take a data argument, as the error above indicates). You should make sure you’ve read the documentation, and if that doesn’t totally make sense, you’ll probably want to consult the main vignette included with the glmnet package.

blackish952 · June 18, 2018, 1:11pm

@jcblum
Thanks so much!

Do you happen to know where I can find a good intro resource to Time Series Analysis and Forecasting in R?

Thanks again!

jcblum · June 18, 2018, 1:19pm

This is an excellent place to start:

blackish952 · June 18, 2018, 1:26pm

@jcblum
Is it a good book to purchase?
There are tons of books on Amazon and this book is like 56$.
I like to hold a book and read instead of having a pdf file.

What do you think?

mara · June 18, 2018, 1:33pm

If you're going to be forecasting in R, Rob J Hyndman is two-thumbs-up the way to go! I'm the same way about books, but you can kind of split the difference and get the earlier edition for ~$35 bucks (full disclosure, not sure how much has changed). You can also check out Hyndman's blog posts, notes, etc. most of which are linked through his amazon page:

blackish952 · June 18, 2018, 1:34pm

@mara
Thanks Mara.
I appreciate your input.

mara · June 18, 2018, 1:35pm

Sure thing. I forgot, he also has a course on datacamp, if you're in the mood for something interactive:

blackish952 · June 18, 2018, 1:41pm

@mara
I am also taking a course on coursera.
I think datacamp is sorta new (I could be wrong) and I am not sure how qualified the programs but I have been reading tutorials on Datacamp since last December 2017. They are useful but sorta "quick to the point for END-USERS".
To me, I am not going to learn as end-users like most institutions train students: here is the lm() package, plug in the data set, do summary(), submit the report. Score!
I hate that. You know what I mean?
I want to become a true future data scientist who knows: Statistics, Applied Mathematics and Computer Science (ML, etc.) which happen to be my three majors for now.
Just a brief intro about myself.

Thanks @ Mara and @jcblum

jcblum · June 18, 2018, 2:32pm

Although I’m pretty much the opposite these days (at least when it comes to this kind of book), I understand how a physical book can work better for people. I think this one is very much worth the money — and fwiw it’s much less expensive than the typical textbook these days.

Buying the now-cheaper first edition is definitely an option, but you should know that there are substantial differences between the first and second editions (described right up front in the 2nd edition preface — scroll down) so you might wind up going back and forth between the print 1st ed and the online 2nd ed to some extent.

The online version is the canonical one — it’s updated first and most frequently. The goal is for the book to be as freely and widely available as possible, but in the end making physical things costs money (especially when you’re not printing hundreds or thousands of titles — like many other profs, Hyndman grew disenchanted with the big textbook publishing racket, so he publishes this book himself).

blackish952 · June 18, 2018, 2:33pm

@jcblum
Thank you for your response! <3

blackish952 · June 19, 2018, 1:21am

@JohnMount
Hello John,
Question 1:

> glm_model <- glm(down_train$Class~.,data = down_train, family = binomial)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(glm_model)

Call:
glm(formula = down_train$Class ~ ., family = binomial, data = down_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1399  -1.0367   0.2300   0.9208   8.4904  

Coefficients:
                                         Estimate  Std. Error  z value             Pr(>|z|)    
(Intercept)                           -11.4108728  23.4467343   -0.487              0.62649    
cust_prog_levelC                        1.1169289   0.1379501    8.097 0.000000000000000565 ***
cust_prog_levelD                        0.9901193   0.1298182    7.627 0.000000000000024034 ***
cust_prog_levelE                        1.1544305   0.1338535    8.625 < 0.0000000000000002 ***
cust_prog_levelG                        0.9318156   0.1293693    7.203 0.000000000000590075 ***
cust_prog_levelI                        1.2254934   0.1324209    9.255 < 0.0000000000000002 ***
cust_prog_levelL                        1.2905412   0.1296241    9.956 < 0.0000000000000002 ***
cust_prog_levelM                        0.9916104   0.3064315    3.236              0.00121 ** 
cust_prog_levelN                        1.2756395   0.1293191    9.864 < 0.0000000000000002 ***
cust_prog_levelP                        0.9498830   0.1293275    7.345 0.000000000000206089 ***
cust_prog_levelR                        0.7728894   0.2732953    2.828              0.00468 ** 
cust_prog_levelS                        0.9850813   0.1294868    7.608 0.000000000000027928 ***
cust_prog_levelX                        1.7231817   0.3874403    4.448 0.000008683257380494 ***
cust_prog_levelZ                        2.2222995   0.1342246   16.557 < 0.0000000000000002 ***
CUST_REGION_DESCRMOUNTAIN WEST REGION  11.0671194  23.4463814    0.472              0.63691    
CUST_REGION_DESCRNORTH CENTRAL REGION  11.2638726  23.4463804    0.480              0.63094    
CUST_REGION_DESCRNORTH EAST REGION     11.3171832  23.4463806    0.483              0.62932    
CUST_REGION_DESCROHIO VALLEY REGION    11.3389408  23.4463811    0.484              0.62866    
CUST_REGION_DESCRSOUTH CENTRAL REGION  11.2946499  23.4463802    0.482              0.63000    
CUST_REGION_DESCRSOUTH EAST REGION     11.3990264  23.4463799    0.486              0.62684    
CUST_REGION_DESCRWESTERN REGION        11.1147124  23.4463816    0.474              0.63547    
Sales                                  -0.0107124   0.0000521 -205.598 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 522966  on 377239  degrees of freedom
Residual deviance: 431903  on 377218  degrees of freedom
AIC: 431947

Number of Fisher Scoring iterations: 9

So the summary() give me this result and there are quite a few estimates with standard errors exceed 23. So can I assume the warning message "glm.fit: fitted probabilities numerically 0 or 1 occurred" explains for this large number? I am asking this question because the other standard errors are small and those mentioned above are so so large, so I am curious why!

Question 2: I took your advice and used glmnet() method

>xfactors<-model.matrix(Class ~ CUST_REGION_DESCR + cust_prog_level,data=down_train)[,-1]
>x = as.matrix(data.frame(down_train$Sales,xfactors))
>glmmod = glmnet(x,y=as.factor(down_train$Class),alpha=1,family='binomial')
> summary(glmmod)
           Length Class     Mode     
a0           60   -none-    numeric  
beta       1260   dgCMatrix S4       
df           60   -none-    numeric  
dim           2   -none-    numeric  
lambda       60   -none-    numeric  
dev.ratio    60   -none-    numeric  
nulldev       1   -none-    numeric  
npasses       1   -none-    numeric  
jerr          1   -none-    numeric  
offset        1   -none-    logical  
classnames    2   -none-    character
call          5   -none-    call     
nobs          1   -none-    numeric

What does the result tell me?
One article says that:
if you take logarithm on both side of the below equation,
08%20PM
you will get:
17%20PM
The quantity on the left is the logarithm of the odds. So, the model is a linear regression of the log-odds, sometimes called logit, and hence the name logistic.

Since I have more than one regressor(Sales, cust_program_level, CUST_REGION_DESCR), I assume I will have beta_0, beta_1, beta_2 or something like that.

Should the explanation for summary(glmmod) be focused on whether we have some estimates for those betas just as summary() for a linear regression model?

Thank you!

JohnMount · June 19, 2018, 6:50pm

The large std.errors. in this case can be a symptom of the quasi-spearation. The coefficients may be trying to run to infinity based on some levels of CUST_REGION_DESCR. I would look at the data with something like table(down_train$Class, down_train$CUST_REGION_DESCR).
Unfortunately glmnet is not a drop in replacement for glm. glmnet builds a series of models, one for each different value of a regularization parameter called s. To see the model use coef() and supply an s, as in: coef(glmmod, s = 0.001). You won't get error bars and t-values as the regularization term interferes with that interpretation. Picking the s hyperparmeter can be an art, and is usually done with cross-validation, which the function cv.glmnet() supplies.

rahmed · June 25, 2018, 12:29am

This is not a Rstudio problem but a modelling issue. As @JohnMount indicated you have so called PERFECT PREDICTION problem. That means for one(/more) predictor(s) you can determine threshold which can predict the outcome exactly. With an handful of predictors you can always use table (for categorical predictors) or boxplot (plot continuous predictors against outcome) to find the offending predictors. You should not use those predictors in your model.

JohnMount · June 25, 2018, 1:02am

One subtle point- the perfect predictors cause some issues with the logistic model, but they are also often among your more important predictors. So one doesn't want to fully throw them out, but use them outside the model (such as in an initial decision tree).

rahmed · June 25, 2018, 1:20am

I do agree with your thoughts specially when doing a ATTRIBUTION type modelling exercise. In my experience it's a bad idea when people ignore that warning message as it often causes information leakage in PREDICTIVE modelling exercise. Later it turns out that such perfect predictors/feature are not available for new data for which predictions are to be made. This might cause a serious difference in model's performance specially for those models which doesn't stop because of missing value.

blackish952 · June 25, 2018, 1:02pm

Can you please show me how to use the rms:lrm() in Rstudio to accomplish what you described above?
Thanks!

rahmed · June 26, 2018, 1:05am

That is my mistake. rms::lrm() gives multicollinear predictor not the perfect predictor automatically. I already edited my response and took that part out. Manual inspection using tables/plots will do the job.