Problem with releveling categorical variable in zero-inflated negative binomial model

tmauch · April 22, 2020, 3:52pm

Hello everyone,

I have a sample of items that I've tracked to see how many times they were mentioned in the mass news media (so my dependent variable is MenCount). The main independent variable (OAStatus) I want to track is categorical with three unordered levels: gold, green, and pink. I do have at least one continuous independent variable I want to include, JJIF. I have about 626,000 records, 85 percent of which have no news mention, so MenCount=0. But I have plenty of items in each color level that do have news mentions. The median count is 3, but I have some pretty big outliers with a count of 3,000+, meaning my data appears to have a pretty big overdispersion. Because of the large number of zeroes that I have and the overdispersion, it seems I should use a zero-inflated negative binomial model.

When I first started running this in R, using different mixes of variables, I had no problem with my main two independent variables, OAStatus and JJIF. However, after playing around with different mixes, when I went back to just these two, I started getting error messages and NAs in my result. I've narrowed the problem down to releveling OAStatus. If I let R set the level by default, so alphabetical, gold is my intercept, and I get no NAs or warnings. However, I need pink to be my intercept, but when I relevel and run the code, I get this:

Call:
zeroinfl(formula = MenCount ~ OAStatus | OAStatus, data = AllItems_TotalCount, 
    dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
 -0.2530  -0.2474  -0.1850  -0.1850 691.1543 

Count model coefficients (negbin with log link):
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.118191   0.010052   11.76   <2e-16 ***
OAStatusGold   0.293901   0.014946   19.66   <2e-16 ***
OAStatusGreen  1.613422   0.017901   90.13   <2e-16 ***
Log(theta)    -2.737584   0.005584 -490.27   <2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept)   -12.4955         NA      NA       NA
OAStatusGold   12.2293         NA      NA       NA
OAStatusGreen  -0.6785     3.8898  -0.174    0.862
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 0.0647 
Number of iterations in BFGS optimization: 24 
Log-likelihood: -5.035e+05 on 7 Df
Warning message:
In sqrt(diag(object$vcov)) : NaNs produced

I'm only using the one independent variable, so I don't see how collinearity would be a problem. I've also tried to relevel so that green was the intercept and got even more NAs. Same if I try adding in my other independent variable, JJIF. Any ideas why the model would work only if OAStatus was leveled one certain way but not any other? And why would it have worked previously but no longer?

I should also note that I've tried several ways of releveling:

AllItems_TotalCount$OAStatus <- relevel(AllItems_TotalCount$OAStatus, ref = "Green")

and

AllItems_TotalCount$OAStatus <- factor(AllItems_TotalCount$OAStatus, levels = c("Paywalled", "Gold", "Green"))

I even tried to cheat and just change the value "pink" to "blue" so it would automatically come first, but I still had the problem of the NAs.

Max · April 23, 2020, 3:30am

Can you do a cross-tabulation

table(AllItems_TotalCount$OAStatus, AllItems_TotalCount$MenCount == 0)

tmauch · April 23, 2020, 5:32pm

Sure.

table(AllItems_TotalCount$OAStatus, AllItems_TotalCount$MenCount == 0)
           
             FALSE   TRUE
  Gold       41593 352024
  Green      28560  42667
  Paywalled  17688 144016

system · May 14, 2020, 5:32pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.