# multi-level categorical variable in felm linear regression

A `reprex` (see the FAQ) would be helpful.

The problem that `felm()` addresses is that an `lm()` model in the form

``````lm(y ~ x1+x2+x3 + f1+f2+f3)
``````

where f1,f2,f3 are arbitrary factors, and x1,x2,x3 are other covariates

that performs satisfactorily when the number of factor levels is not large may not when the number of levels is large because of collinearities between factors and other covariants. When modeling a high-`N` model with a number of levels equal to the number of subjects (observations) in a large dataset, for example, neither `lm()` nor sparse matrix approaches in `{Matrix}` are computationally feasible. That implies that `felm()` may not be suitable for datasets with a relatively small number of levels in factors.

The case of a single-factor model, likewise, does not appear to call for `felm()` as the factor can be eliminated through the within groups transformation. It is the case with two or more factors in the presence of non-factor covariates that `felm()` is intended to address. It does so through "projecting" out the factor with the highest number of levels, coding the others as dummy variables. As can be seen in the following `reprex` the effect is to omit coefficients for factor (categorical) variables from the model , leaving only the non-factor covariates. Compared to the full model, the projected model has only as many coefficients as the non-factor variables, corresponding to fewer degrees of freedom in equal measure.

``````library(lfe)
## Simulate data
set.seed(42)
n <- 1e3

d <- data.frame(
# Covariates
x1 = rnorm(n),
x2 = rnorm(n),
# Individuals and firms
id = factor(sample(20, n, replace = TRUE)),
firm = factor(sample(13, n, replace = TRUE)),
# Noise
u = rnorm(n)
)

# Effects for individuals and firms
id.eff <- rnorm(nlevels(d\$id))
firm.eff <- rnorm(nlevels(d\$firm))

# Left hand side
d\$y <- d\$x1 + 0.5 * d\$x2 + id.eff[d\$id] + firm.eff[d\$firm] + d\$u

## Estimate the model and print the results
est <- felm(y ~ x1 + x2 | id + firm, data = d)
summary(est)
#>
#> Call:
#>    felm(formula = y ~ x1 + x2 | id + firm, data = d)
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -3.3751 -0.6768  0.0088  0.6883  2.7803
#>
#> Coefficients:
#>    Estimate Std. Error t value Pr(>|t|)
#> x1  1.04326    0.03228   32.32   <2e-16 ***
#> x2  0.49041    0.03254   15.07   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.005 on 966 degrees of freedom
#> Multiple R-squared(full model): 0.7539   Adjusted R-squared: 0.7455
#> Multiple R-squared(proj model): 0.5696   Adjusted R-squared: 0.5549
#> F-statistic(full model):89.69 on 33 and 966 DF, p-value: < 2.2e-16
#> F-statistic(proj model): 639.2 on 2 and 966 DF, p-value: < 2.2e-16
# Compare with lm
summary(lm(y ~ x1 + x2 + id + firm - 1, data = d))
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + id + firm - 1, data = d)
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -3.3751 -0.6768  0.0088  0.6883  2.7803
#>
#> Coefficients:
#>        Estimate Std. Error t value Pr(>|t|)
#> x1      1.04326    0.03228  32.319  < 2e-16 ***
#> x2      0.49041    0.03254  15.072  < 2e-16 ***
#> id1     3.74166    0.17650  21.199  < 2e-16 ***
#> id2     0.96200    0.17927   5.366 1.01e-07 ***
#> id3     1.02686    0.20249   5.071 4.74e-07 ***
#> id4     2.13960    0.17190  12.447  < 2e-16 ***
#> id5     1.12131    0.17503   6.406 2.32e-10 ***
#> id6     0.85863    0.18845   4.556 5.87e-06 ***
#> id7     0.85256    0.17839   4.779 2.03e-06 ***
#> id8     1.25744    0.18396   6.835 1.45e-11 ***
#> id9    -0.95332    0.19765  -4.823 1.64e-06 ***
#> id10    0.50332    0.18943   2.657 0.008014 **
#> id11    1.29660    0.18697   6.935 7.44e-12 ***
#> id12    2.00367    0.17489  11.457  < 2e-16 ***
#> id13   -0.02849    0.20090  -0.142 0.887257
#> id14    0.66788    0.18563   3.598 0.000337 ***
#> id15   -0.07461    0.17510  -0.426 0.670153
#> id16    1.51743    0.17799   8.525  < 2e-16 ***
#> id17    2.10649    0.18372  11.466  < 2e-16 ***
#> id18    1.18966    0.17464   6.812 1.69e-11 ***
#> id19    1.34483    0.18893   7.118 2.13e-12 ***
#> id20   -1.20084    0.18328  -6.552 9.21e-11 ***
#> firm2  -1.50725    0.17093  -8.818  < 2e-16 ***
#> firm3  -1.87472    0.17236 -10.877  < 2e-16 ***
#> firm4  -1.24848    0.16611  -7.516 1.29e-13 ***
#> firm5  -0.74181    0.15959  -4.648 3.81e-06 ***
#> firm6   0.11010    0.16544   0.665 0.505893
#> firm7  -1.01232    0.16797  -6.027 2.37e-09 ***
#> firm8  -2.48896    0.16741 -14.868  < 2e-16 ***
#> firm9  -1.52025    0.16137  -9.421  < 2e-16 ***
#> firm10 -1.31793    0.15813  -8.334 2.66e-16 ***
#> firm11 -1.14281    0.15977  -7.153 1.68e-12 ***
#> firm12 -0.60866    0.17645  -3.449 0.000586 ***
#> firm13 -1.28568    0.16513  -7.786 1.78e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.005 on 966 degrees of freedom
#> Multiple R-squared:  0.7542, Adjusted R-squared:  0.7455
#> F-statistic: 87.17 on 34 and 966 DF,  p-value: < 2.2e-16
``````

Created on 2023-05-23 with reprex v2.0.2

1 Like