replicating Stata's xtmixed with lmer in R

mburzillo · April 2, 2020, 5:03pm

Hi all. I am trying to replicate a Stata paper in R and am having an issue where the results of a multi-level mixed-effects linear regression are very similar, but slightly off (especially the standard errors). I am wondering if anyone knows why this is happening. I am using the exact same dataset in both specifications (which can be found here). The Stata code I am trying to replicate is:

use "/Users/mariaburzillo/Desktop/GOV1006/final_project/racial_polarization_winners.dta", clear

xtmixed biggestsplit H_citytract_multi_i diversityinterp pctasianpopinterp pctblkpopinterp pctlatinopopinterp medincinterp pctrentersinterp pctcollegegradinterp biracial nonpartisan primary logpop  i. year south midwest west if winner==1||geo_id2:

Which gives the following output:

The reprex for my attempt to replicate this in R is:

library(haven)
library(tidyverse)
library(lme4)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
library(reprex)

# load data

rp <- read_dta("/Users/mariaburzillo/Desktop/GOV1006/final_project/racial_polarization_winners.dta")


# create factor variable for year

rp$year.f <- as.factor(rp$year)

# apply condition that winner == 1 as in Stata regression

rp_sub <- rp %>%
  filter(winner == 1)

# multi-level mixed effects regression with random effects for geo_id2

m1 <- lmer(biggestsplit ~ H_citytract_multi_i + diversityinterp + pctasianpopinterp + pctblkpopinterp + pctlatinopopinterp + medincinterp + pctrentersinterp +  pctcollegegradinterp + biracial + nonpartisan + primary + logpop + year.f + south + midwest + west + (1 | geo_id2), data = rp_sub)
#> Warning: Some predictor variables are on very different scales: consider
#> rescaling
#> boundary (singular) fit: see ?isSingular

# summarize

summary(m1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: 
#> biggestsplit ~ H_citytract_multi_i + diversityinterp + pctasianpopinterp +  
#>     pctblkpopinterp + pctlatinopopinterp + medincinterp + pctrentersinterp +  
#>     pctcollegegradinterp + biracial + nonpartisan + primary +  
#>     logpop + year.f + south + midwest + west + (1 | geo_id2)
#>    Data: rp_sub
#> 
#> REML criterion at convergence: -5.2
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.0918 -0.4281  0.0000  0.5283  1.9850 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  geo_id2  (Intercept) 0.00000  0.000   
#>  Residual             0.02312  0.152   
#> Number of obs: 91, groups:  geo_id2, 25
#> 
#> Fixed effects:
#>                        Estimate Std. Error t value
#> (Intercept)          -2.419e-01  7.126e-01  -0.339
#> H_citytract_multi_i   9.320e-01  4.937e-01   1.888
#> diversityinterp       3.845e-01  4.530e-01   0.849
#> pctasianpopinterp    -1.149e-01  6.605e-01  -0.174
#> pctblkpopinterp      -4.318e-01  3.373e-01  -1.280
#> pctlatinopopinterp   -1.905e-01  3.213e-01  -0.593
#> medincinterp         -4.225e-06  8.304e-06  -0.509
#> pctrentersinterp     -5.796e-01  5.282e-01  -1.097
#> pctcollegegradinterp  3.277e-01  8.903e-01   0.368
#> biracial              2.104e-01  4.647e-02   4.527
#> nonpartisan          -8.956e-02  8.312e-02  -1.077
#> primary              -9.161e-02  3.998e-02  -2.291
#> logpop                3.466e-02  6.904e-02   0.502
#> year.f1991            7.068e-02  7.788e-02   0.908
#> year.f1993            6.612e-02  1.159e-01   0.570
#> year.f1994            1.153e-01  2.075e-01   0.556
#> year.f1995            4.397e-02  9.424e-02   0.467
#> year.f1996            1.255e-01  1.846e-01   0.680
#> year.f1997            5.762e-02  1.097e-01   0.526
#> year.f1998            2.933e-02  1.963e-01   0.149
#> year.f1999           -2.390e-02  1.027e-01  -0.233
#> year.f2000            2.958e-01  1.871e-01   1.581
#> year.f2001            8.088e-02  1.192e-01   0.679
#> year.f2002            6.029e-02  1.964e-01   0.307
#> year.f2003           -3.516e-02  1.233e-01  -0.285
#> year.f2004           -6.759e-02  1.459e-01  -0.463
#> year.f2005            1.175e-02  1.399e-01   0.084
#> year.f2006            5.294e-02  2.908e-01   0.182
#> year.f2007           -9.040e-02  1.831e-01  -0.494
#> year.f2009           -2.597e-02  2.317e-01  -0.112
#> south                 2.331e-01  1.194e-01   1.953
#> midwest               1.534e-01  1.328e-01   1.156
#> west                  8.498e-02  1.148e-01   0.740
#> 
#> Correlation matrix not shown by default, as p = 33 > 12.
#> Use print(x, correlation=TRUE)  or
#>     vcov(x)        if you need it
#> fit warnings:
#> Some predictor variables are on very different scales: consider rescaling
#> convergence code: 0
#> boundary (singular) fit: see ?isSingular

In response to a request for an example from a built-in dataset, etc., here is an alternative reprex using the EmplUK dataset from the plm package:

library(tidyverse)
library(lme4)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
library(reprex)
library(plm)
#> 
#> Attaching package: 'plm'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, lag, lead

# load data from plm package

data("EmplUK", package="plm")

# multi-level mixed effects regression with random effects for firm


m4 <- lmer(wage ~ as.factor(year) + sector + emp + capital + 
             output + (1 | firm), data = EmplUK)

# summarize

summary(m4)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: wage ~ as.factor(year) + sector + emp + capital + output + (1 |  
#>     firm)
#>    Data: EmplUK
#> 
#> REML criterion at convergence: 4841.1
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -6.5171 -0.5067 -0.0280  0.4502  7.9279 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  firm     (Intercept) 21.489   4.636   
#>  Residual              3.793   1.948   
#> Number of obs: 1031, groups:  firm, 140
#> 
#> Fixed effects:
#>                     Estimate Std. Error t value
#> (Intercept)         26.57338    1.55552  17.083
#> as.factor(year)1977 -1.99478    0.27950  -7.137
#> as.factor(year)1978 -2.63032    0.27979  -9.401
#> as.factor(year)1979 -2.51523    0.27983  -8.988
#> as.factor(year)1980 -2.33948    0.28846  -8.110
#> as.factor(year)1981 -1.37792    0.32335  -4.261
#> as.factor(year)1982 -0.33180    0.33480  -0.991
#> as.factor(year)1983  0.49106    0.37275   1.317
#> as.factor(year)1984  0.33148    0.43801   0.757
#> sector              -0.78632    0.14900  -5.277
#> emp                 -0.09098    0.02393  -3.801
#> capital              0.17433    0.05542   3.145
#> output               0.03097    0.01132   2.736
#> 
#> Correlation matrix not shown by default, as p = 13 > 12.
#> Use print(x, correlation=TRUE)  or
#>     vcov(x)        if you need it


# write to csv for use in stata

write.csv(EmplUK, file = "empUK.csv" )

The corresponding stata command would be:

xtmixed wage i.year sector emp capital output || firm:

With this example, the outputs are again similar, but they are slightly off. I am trying to see if I can exactly reproduce the Stata code in R.

Thank you for the help! Happy to answer any more questions.

technocrat · April 2, 2020, 8:43pm

Here's a stern reply with kind-hearted intentions:

Screenshots of a reprex are not the thing itself. The whole point of a reprex is that it can be generated, then cut-and-pasted into the dialog box for the question, cut-and-pasted out and then pasted into an R session to see what's happening.

I'm not sure if I can help with this or not, but I'm pretty sure that no one is going to if they have to print out the image and type in the code.

mburzillo · April 2, 2020, 9:21pm

Hi. Thank you for the feedback, I don't know why I did that! I just edited the post to include the actual reprex. Thank you.

technocrat · April 2, 2020, 10:43pm

Thanks! Can you share enough of rp to run the code or a built-in data set that will throw the same error?

mburzillo · April 3, 2020, 12:41am

Hi again, I posted the link to the data at the top of the post (or here). I am not sure what the ettiquette is, so if all data must be within the reprex/built in, I just added a similar example with built-in data below my main post. Thanks!

technocrat · April 3, 2020, 12:48am

I used the tab delimited file and received the following

suppressPackageStartupMessages(library(lme4)) 
suppressPackageStartupMessages(library(plm)) 

# load data from plm package

data("EmplUK", package="plm")

# multi-level mixed effects regression with random effects for firm

m4 <- lmer(wage ~ as.factor(year) + sector + emp + capital + 
             output + (1 | firm), data = EmplUK)

# summarize

summary(m4)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: wage ~ as.factor(year) + sector + emp + capital + output + (1 |  
#>     firm)
#>    Data: EmplUK
#> 
#> REML criterion at convergence: 4841.1
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -6.5171 -0.5067 -0.0280  0.4502  7.9279 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  firm     (Intercept) 21.489   4.636   
#>  Residual              3.793   1.948   
#> Number of obs: 1031, groups:  firm, 140
#> 
#> Fixed effects:
#>                     Estimate Std. Error t value
#> (Intercept)         26.57338    1.55552  17.083
#> as.factor(year)1977 -1.99478    0.27950  -7.137
#> as.factor(year)1978 -2.63032    0.27979  -9.401
#> as.factor(year)1979 -2.51523    0.27983  -8.988
#> as.factor(year)1980 -2.33948    0.28846  -8.110
#> as.factor(year)1981 -1.37792    0.32335  -4.261
#> as.factor(year)1982 -0.33180    0.33480  -0.991
#> as.factor(year)1983  0.49106    0.37275   1.317
#> as.factor(year)1984  0.33148    0.43801   0.757
#> sector              -0.78632    0.14900  -5.277
#> emp                 -0.09098    0.02393  -3.801
#> capital              0.17433    0.05542   3.145
#> output               0.03097    0.01132   2.736
#> 
#> Correlation matrix not shown by default, as p = 13 > 12.
#> Use print(x, correlation=TRUE)  or
#>     vcov(x)        if you need it

^{Created on 2020-04-02 by the reprex package (v0.3.0)}

This doesn't seem at all comparable to the Stata screen shots.

Pretty much the same result when calling lmer with REML = FALSE

suppressPackageStartupMessages(library(lme4)) 
suppressPackageStartupMessages(library(plm)) 

# load data from plm package

data("EmplUK", package="plm")

# multi-level mixed effects regression with random effects for firm

m4 <- lmer(wage ~ as.factor(year) + sector + emp + capital + 
             output + (1 | firm), data = EmplUK, REML = FALSE)

# summarize

summary(m4)
#> Linear mixed model fit by maximum likelihood  ['lmerMod']
#> Formula: wage ~ as.factor(year) + sector + emp + capital + output + (1 |  
#>     firm)
#>    Data: EmplUK
#> 
#>      AIC      BIC   logLik deviance df.resid 
#>   4842.4   4916.5  -2406.2   4812.4     1016 
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -6.5547 -0.5098 -0.0281  0.4528  7.9738 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  firm     (Intercept) 21.054   4.588   
#>  Residual              3.749   1.936   
#> Number of obs: 1031, groups:  firm, 140
#> 
#> Fixed effects:
#>                     Estimate Std. Error t value
#> (Intercept)         26.57625    1.54452  17.207
#> as.factor(year)1977 -1.99492    0.27790  -7.179
#> as.factor(year)1978 -2.63046    0.27818  -9.456
#> as.factor(year)1979 -2.51538    0.27822  -9.041
#> as.factor(year)1980 -2.33981    0.28680  -8.158
#> as.factor(year)1981 -1.37846    0.32148  -4.288
#> as.factor(year)1982 -0.33238    0.33287  -0.999
#> as.factor(year)1983  0.49026    0.37060   1.323
#> as.factor(year)1984  0.33059    0.43548   0.759
#> sector              -0.78633    0.14750  -5.331
#> emp                 -0.09100    0.02376  -3.830
#> capital              0.17440    0.05506   3.168
#> output               0.03094    0.01125   2.750
#> 
#> Correlation matrix not shown by default, as p = 13 > 12.
#> Use print(x, correlation=TRUE)  or
#>     vcov(x)        if you need it

^{Created on 2020-04-02 by the reprex package (v0.3.0)}

All I can suggest is looking at the other arguments to lmer to see if any of them would show the results on a comparable basis.

mburzillo · April 3, 2020, 3:33pm

Thank you for the help!

mburzillo · April 3, 2020, 11:57pm

Update: I tried my original code with

and this solved the problem with my dataset! I am not sure why it didn't solve the problem in the other example I posted, but it may have been another issue within that example. All in all, thank you!!!

technocrat · April 4, 2020, 12:04am

Great. Please mark the solution for the benefit of those to follow.

system · April 11, 2020, 12:04am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.