Comparing Panel models after they have clustered SE R

jillspe · February 28, 2019, 11:17am

Hello,

I am analysing FE, RE and Pooled Ols models for Panel data (cantons=26, T=6, N=156, Balanced set). All my variables are in percentage.

Y = employment rate of canton refugees
x1 = percentage share of jobs in small Businesses
x2 = percentage share of jobs in large Businesses
Controls = % share of foreigners, cantonal GDP as a percentage to the country GDP, unemployment rate of natives

I want to adjust my regression models for clustered SE by group (canton = state), because standard errors become understated when serial correlation is present, making hypothesis testing ambiguous.

Since there is only one observation per canton and year, clustering by year and canton is not possible. Results have been clustered by canton, because it is assumed that these clusters are independent from each other based on the autonomous nature of cantons, due to the federalist nature of the country. Year clusters are assumed to be dependent on each other, due to the nature of lag effects in economic theories/mechanisms.

Do I need to prove, that serial correlation is present or is it okay to assume that serial correlation is present, because it is likely that observations of the same canton over time are correlated. If it is necessary to test for serial correlation before clustering SE, which code do I use?

Here is my code:

FE Model: fixedm6 <- plm(Y ~ X + X1 + controls, data=busdata, index=c("canton", "year"), model="within", effect = 'twoways')

FE Model mit clustered SE:
cfixedm6 <- coeftest(fixedm6, vcov=vcovHC(fixedm6, method = "arellano", type="HC3",cluster="group"))

Pooled OLS Model:
m6pool <- plm(Y ~ X + X1 + X2, data=busdata, index=c("canton", "year"), model="pooling")

Pooled OLS mit clustered SE:
cm6pool <- coeftest(m6pool, vcov=vcovHC(m6pool, type="HC3", cluster="group"))

F-test without Clustered SE:
pFtest(fixedm6, m6pool)
p-value < 2.2e-16 ----> FE is better fit
when I insert models with clustered SE:
pFtest(cfixedm6, cm6pool)
Error in UseMethod("pFtest") : no applicable method for 'pFtest' applied to an object of class "coeftest"

The same occurs with other lmtest functions (phtest for Hausmann test). RE Model:

randm6 <- plm(eY ~ X + X1 + X2, index=c("canton", "year"), data=busdata, model="random")

RE MOdel mit clustered SE:
crandm6 <- coeftest(randm6, vcov=vcovHC(randm6, method = "white1", type="HC3", cluster="group"))
Without clustered SE:
phtest(fixedm6, randmo6) = p value indicates FE is better fit
with clustered SE:
phtest(cfixedm6, crandmo6) = Error in UseMethod("phtest") : no applicable method for 'phtest' applied to an object of class "coeftest"

Do I have to compare the models first, without clustered SE and then based on Ftest/hausmann tests etc., find the best model and then cluster the SE for the model?

Without clustering SE in the models, I can easily use lmtests to compare for significances of models. However, this seems false, considering that p-values will be distorted. However, as soon as I include clustered SE, I dont know how to code to compare and determine which model is the best fit.

How do I need to approach this in R? Most online resources discuss FE/RE etc, and then discuss clustered SEs, but never how to compare models that have cluster SEs. What am I doing wrong? Which R code would be best here? Am I using the wrong packages?

best regards, Jill

mara · February 28, 2019, 11:31am

There are a few threads on Cross-Validated that might be of interest:

Including this one which has a couple of R package suggestions:

WRT to the code, would it be possible for you to use a sample of your data (or dummy data) to make a small self-contained reproducible example?

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

jillspe · February 28, 2019, 12:05pm

Hello,

Thanks for the replies and links. I've seen the posts on stakexchange before. Yet, I'm still stuck on the step by step process to implement the clustered SE of panelmodels and then compare them.

I'm trying to do the reprex that you suggested. But, when i use datapasta:: to create a smaller sample, my codes don't work anymore.

I'm so confused as to how I can compare panelmodels, that have clustered SE, in R. Would it be completely false, to compare the models without clustered SE first, and based of those tests determine the best fit and then cluster the SE for appropriate significances?

Thank you so much for the help.
best regards, Jill

alexpghayes · February 28, 2019, 3:36pm

Re (1): You can assume that serial correlation is present and adjust for it and the results will still be correct even if there isn't serial correlation (here I presume you're talking about whether or not to use HAC standard errors). If there isn't serial correlation, you will lose some power, but the results will still be valid.

I don't know anything about comparing panel models, and typically the econometrics community doesn't come through this website very often.

You might be able to get better help by asking the Declare Design crew, or asking on the Stan forums, although they'll want to do something Bayesian there. Data Methods is another possibility, but again this doesn't like a perfect fit for them either. I'm putting out some feelers about better places for econometrics / panel data questions and will you let you know if I hear back.

Peter_Griffin · February 28, 2019, 4:41pm

I use clustered SE a lot, please use the lfe package instead of plm package:

https://cran.r-project.org/web/packages/lfe/index.html

For comparisons, it depends what your goal is.

jillspe · February 28, 2019, 5:10pm

Yes, I have changed all my models from plm to the lfe package felm formats. However, the lfe package is not as clear as the plm. I could not find out, how i can test whether the assumed fixed effects are actually significant.

How can I determine that the fixed effects are actually significant? (in the plm package this is pFtest, of comparing the two-ways with just fixed time or just fixed stat)
how can I compare the FE felm to random effects or poooled ols, which in the plm package is the phtest for hausman test or pFtest to compare with OLS) - As I read, it is not possible to create a random effects model in the lfe package.
What diagnostics or how do I approach the diagnostics to determine that my model is a good fit for the data, besides the assumed theoretic background that the FE Model is best for my research question?

thank you so much! I really appreciate the help!
best regards, Jill

Peter_Griffin · February 28, 2019, 6:15pm

To my personal experience, FE models are more commonly used than RE and I have never used RE in any of my studies. FE models just add more dummy variables, and clustering adds some penalty to the significance.

As I said, there is a lot of measures of goodness of fitting, I am really not sure in what dimension you want to prove one model is better than another.

My personally experience is that it makes more sense to make your theoretical model better than to improve your regression models.

jillspe · February 28, 2019, 9:48pm

Thank you for the sugesstion! I really appreciate the feedback. Its helped a lot.

When you write "what dimension you want to prove" what do you mean specifically - as in what are there to prove? (i hope this question isn´t too dumb, I'm a beginner )

For example what is the meaning of the Fstatistic in this case and how I interpret it?
Also, are the R2 and adjusted r2 that lfe puts out correct? Because they are significantly different from what I had in the plm version of my regression.

My research question is: what are the determinants of regional employments - I used FE model, because I assume that the states are autonomous and fixed effect year because I want to account for characteristics common to all states observed during the same time period (culture, monetary effects etc).

THANK U! So kind!

Peter_Griffin · February 28, 2019, 10:11pm

I am not sure year effect is needed here, normally a business cycle takes more than 1 year. Region effect can account for factors that are not included in your regression model, which makes senses.

According to your research question, I personally think beta and p-value is more important than goodness of fit.

jillspe · March 1, 2019, 12:15am

Alright perfect. Thank you!

Regarding the year effect - even if I have panel data from 2011-2016?

perfect, I shall read up on that beta. I can get the p-value of the model if I get the F statistic right?

Peter_Griffin you are really saving me ! Thanks!

system · March 22, 2019, 12:15am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.