Help with multivariate probit model

Hi community,
To analyse my data, a multivariate probit model is suggested in the literature. I have six dependent variables and up to 7 independent variables. I have already coded the data in Excel and then read it into R for the analysis. At first I got no output at all, but in the meantime I managed to get something. Unfortunately, the following error message appears at the end and the data are not comprehensible:

When I enter warnings(), the same error message always comes up, namely:

The correlation matrix is not positive definite

and this many more times.

I don't quite understand it, because I did it the same way as in the paper "What hampers innovation? Revealed barriers versus deterring barriers" by D'Este et al. (2012).
ChatGPT couldn't help me any more either.

This is the code I need for the model:
model.MVP2 <- mvProbit(cbind(Cost.B, Knowledge.B, Market.B, Regulation.B, Data.B, Trust.B) ~ Little + Average + High + Very_High + LN_Employees + Higher_Education + RD_Invest, data = data)
summary(model.MVP2)

The data are all binary coded except for "LN_Employees + Higher_Education + RD_Invest", but this should not be a problem.

Can someone help me? I'd be very thankfull.

educated guess might be that you have relatively too few samples.
This would imply one of

  1. getting more samples
  2. dropping out some independent variables from your model to simplify it

# start common example setup

library(mvProbit)
library(tidyverse)

set.seed(123)
iris_1hot_full<- mutate(iris,
       setosa = 1*(Species=="setosa"),
       versicolor = 1*(Species=="versicolor"),
       virginica = 1*(Species=="virginica")
) |> select(-Species)


# end common example setup

# example of the problem : 
iris_1hot <- slice_sample(iris_1hot_full,n=20)

mvProbit( iris_1hot |> select(setosa,versicolor,virginica) |> as.matrix() ~
            Petal.Length+Petal.Width+Sepal.Length+Sepal.Width,
          data=iris_1hot,nGHK=50)
# Error in maxNRCompute(fn = function (theta, fnOrig, gradOrig = NULL, hessOrig = NULL,  : 
#                                        NA in gradient

# 1) more samples 
iris_1hot2 <- slice_sample(iris_1hot_full,n=40)

mvProbit( iris_1hot2 |> select(setosa,versicolor,virginica) |> as.matrix() ~
            Petal.Length+Petal.Width+Sepal.Length+Sepal.Width,
          data=iris_1hot2,nGHK=50)

# 2) same low samples as at first but fewer independent variables
mvProbit( iris_1hot |> select(versicolor,virginica) |> as.matrix() ~
            Petal.Length+Petal.Width+Sepal.Length+Sepal.Width,
          data=iris_1hot,nGHK=50)
1 Like

Assuming you do have a large sample, I wonder if including Little + Average + High + Very_High is creating a dummy variable trap? (Although usually R is good at handling this somewhat gracefully.)

1 Like

The signature for a mvProbit() is

mvProbit( formula, data, start = NULL, startSigma = NULL, 
   method = "BHHH", finalHessian = "BHHH", 
   algorithm = "GHK", nGHK = 1000, 
   intGrad = TRUE, oneSidedGrad = FALSE, eps = 1e-6, 
   random.seed = 123, ... )

but the error indicates that your model has two of the defaults overridden with NULL. Try omitting those arguments.

1 Like

Thanks a lot for your contribution.

In fact, I have only 86 observations. Is this too few?

I have estimated the six univariate probits separately and in some models, the standard error is extremely high, for example

all:
glm(formula = Markt.B ~ Wenig + Durchschnittlich + Hoch + Sehr_hoch +
LN_Mitarbeitende + LN_Hochschulabschluss + LN_RDI, family = binomial(link = "probit"))

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.63159 0.53591 -1.179 0.2386
Wenig 0.19699 0.51478 0.383 0.7020
Durchschnittlich -0.13406 0.42270 -0.317 0.7511
Hoch -5.17068 414.80216 -0.012 0.9901
Sehr_hoch -4.50883 865.48937 -0.005 0.9958
LN_Mitarbeitende 0.09653 0.09574 1.008 0.3133
LN_Hochschulabschluss 0.01614 0.16146 0.100 0.9204
LN_RDI -0.36327 0.20095 -1.808 0.0706 .

Signif. codes: 0 β€˜β€™ 0.001 β€˜β€™ 0.01 β€˜β€™ 0.05 β€˜.’ 0.1 β€˜ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 85.512  on 85  degrees of freedom

Residual deviance: 71.749 on 78 degrees of freedom
(5 Beobachtungen als fehlend gelΓΆscht)
AIC: 87.749

Number of Fisher Scoring iterations: 17

Do you have any other suggestion for me?

In principle, 86 is enough although that might explain large standard errors. Can you check whether
Little + Average + High + Very_High == 1?

they are not. Why should they be? This is an ordinally scaled variable (innovation activity from 1-5), which I have made into dummies.

No, they shouldn't. Since you have 4 dummies for 5 categories you are not in the dummy-variable trap. In principle, this should work.

One possibility is that one of the 5 categories is extremely rare and maybe this leads to estimation difficulties. How many fall into each of the five categories?

None: 37
Little: 12
Average: 27
High: 12
Very high: 3

In principle that should be fine. You might try dropping the 3 Very High observations and see if that magically makes things work.

Thanks. It worked but....

b_1_0 ? b_1_1 ? what are these the levels of ? they probably shouldnt be in there.

I think those are because there are 6 dependent variables.

Ah, I see, thank you startz

@Rstarterpack I don't see that you're doing anything wrong. Perhaps fold the very high group into the high group and then do individual probits. Then maybe use what you find to get starting values for the multivariate version.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.