multi-level categorical variable in felm linear regression

I can't answer with confidence simply by reading this explainer.

A reprex (see the FAQ) would be helpful.

The problem that felm() addresses is that an lm() model in the form

lm(y ~ x1+x2+x3 + f1+f2+f3)

where f1,f2,f3 are arbitrary factors, and x1,x2,x3 are other covariates

that performs satisfactorily when the number of factor levels is not large may not when the number of levels is large because of collinearities between factors and other covariants. When modeling a high-N model with a number of levels equal to the number of subjects (observations) in a large dataset, for example, neither lm() nor sparse matrix approaches in {Matrix} are computationally feasible. That implies that felm() may not be suitable for datasets with a relatively small number of levels in factors.

The case of a single-factor model, likewise, does not appear to call for felm() as the factor can be eliminated through the within groups transformation. It is the case with two or more factors in the presence of non-factor covariates that felm() is intended to address. It does so through "projecting" out the factor with the highest number of levels, coding the others as dummy variables. As can be seen in the following reprex the effect is to omit coefficients for factor (categorical) variables from the model , leaving only the non-factor covariates. Compared to the full model, the projected model has only as many coefficients as the non-factor variables, corresponding to fewer degrees of freedom in equal measure.

library(lfe)
#> Loading required package: Matrix
## Simulate data
set.seed(42)
n <- 1e3

d <- data.frame(
  # Covariates
  x1 = rnorm(n),
  x2 = rnorm(n),
  # Individuals and firms
  id = factor(sample(20, n, replace = TRUE)),
  firm = factor(sample(13, n, replace = TRUE)),
  # Noise
  u = rnorm(n)
)

# Effects for individuals and firms
id.eff <- rnorm(nlevels(d$id))
firm.eff <- rnorm(nlevels(d$firm))

# Left hand side
d$y <- d$x1 + 0.5 * d$x2 + id.eff[d$id] + firm.eff[d$firm] + d$u

## Estimate the model and print the results
est <- felm(y ~ x1 + x2 | id + firm, data = d)
summary(est)
#> 
#> Call:
#>    felm(formula = y ~ x1 + x2 | id + firm, data = d) 
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.3751 -0.6768  0.0088  0.6883  2.7803 
#> 
#> Coefficients:
#>    Estimate Std. Error t value Pr(>|t|)    
#> x1  1.04326    0.03228   32.32   <2e-16 ***
#> x2  0.49041    0.03254   15.07   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.005 on 966 degrees of freedom
#> Multiple R-squared(full model): 0.7539   Adjusted R-squared: 0.7455 
#> Multiple R-squared(proj model): 0.5696   Adjusted R-squared: 0.5549 
#> F-statistic(full model):89.69 on 33 and 966 DF, p-value: < 2.2e-16 
#> F-statistic(proj model): 639.2 on 2 and 966 DF, p-value: < 2.2e-16
# Compare with lm
summary(lm(y ~ x1 + x2 + id + firm - 1, data = d))
#> 
#> Call:
#> lm(formula = y ~ x1 + x2 + id + firm - 1, data = d)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.3751 -0.6768  0.0088  0.6883  2.7803 
#> 
#> Coefficients:
#>        Estimate Std. Error t value Pr(>|t|)    
#> x1      1.04326    0.03228  32.319  < 2e-16 ***
#> x2      0.49041    0.03254  15.072  < 2e-16 ***
#> id1     3.74166    0.17650  21.199  < 2e-16 ***
#> id2     0.96200    0.17927   5.366 1.01e-07 ***
#> id3     1.02686    0.20249   5.071 4.74e-07 ***
#> id4     2.13960    0.17190  12.447  < 2e-16 ***
#> id5     1.12131    0.17503   6.406 2.32e-10 ***
#> id6     0.85863    0.18845   4.556 5.87e-06 ***
#> id7     0.85256    0.17839   4.779 2.03e-06 ***
#> id8     1.25744    0.18396   6.835 1.45e-11 ***
#> id9    -0.95332    0.19765  -4.823 1.64e-06 ***
#> id10    0.50332    0.18943   2.657 0.008014 ** 
#> id11    1.29660    0.18697   6.935 7.44e-12 ***
#> id12    2.00367    0.17489  11.457  < 2e-16 ***
#> id13   -0.02849    0.20090  -0.142 0.887257    
#> id14    0.66788    0.18563   3.598 0.000337 ***
#> id15   -0.07461    0.17510  -0.426 0.670153    
#> id16    1.51743    0.17799   8.525  < 2e-16 ***
#> id17    2.10649    0.18372  11.466  < 2e-16 ***
#> id18    1.18966    0.17464   6.812 1.69e-11 ***
#> id19    1.34483    0.18893   7.118 2.13e-12 ***
#> id20   -1.20084    0.18328  -6.552 9.21e-11 ***
#> firm2  -1.50725    0.17093  -8.818  < 2e-16 ***
#> firm3  -1.87472    0.17236 -10.877  < 2e-16 ***
#> firm4  -1.24848    0.16611  -7.516 1.29e-13 ***
#> firm5  -0.74181    0.15959  -4.648 3.81e-06 ***
#> firm6   0.11010    0.16544   0.665 0.505893    
#> firm7  -1.01232    0.16797  -6.027 2.37e-09 ***
#> firm8  -2.48896    0.16741 -14.868  < 2e-16 ***
#> firm9  -1.52025    0.16137  -9.421  < 2e-16 ***
#> firm10 -1.31793    0.15813  -8.334 2.66e-16 ***
#> firm11 -1.14281    0.15977  -7.153 1.68e-12 ***
#> firm12 -0.60866    0.17645  -3.449 0.000586 ***
#> firm13 -1.28568    0.16513  -7.786 1.78e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.005 on 966 degrees of freedom
#> Multiple R-squared:  0.7542, Adjusted R-squared:  0.7455 
#> F-statistic: 87.17 on 34 and 966 DF,  p-value: < 2.2e-16

Created on 2023-05-23 with reprex v2.0.2

1 Like