Question About lm() Estimating Variable Coefficients in a linear model when 2 Vars are Correlated

tw0handt0uch · August 1, 2021, 7:24pm

Hello. I build a small simulation to help me understand simple linear regression estimation of coefficient p-values. The eventual goal was to examine step-wise regression and determine how often it was wrong (selecting redundant variables) under various conditions of noise and variable correlation. But now I've gotten myself confused on the very basics...

In this code below, I setup a simulation using 20 predictor variables as follows:

Set the beta coefficients for 20 variables such that the first has a coefficient of 1 and all the rest are 0.
Draw the x's for each of the 20 variables from a standard normal distribution
Modify X2 (one of the vars with 0 coeficient) to be equal to X1 + some noise (X2 = X1 + rnorm(1, 0, 1)
Create the observations / outcome variable by multiplying each Beta coeficient by its X value, summing them, then adding a residual error to the total (rnorm(1, 0, .3)).

The goal of all of this was to see how often the model identified variable 2 as being significant based on the different noise settings. The problem is that when I run lm() on the final dataset, I can't ever get it to pick up variable 2 as significant. While this is true (its true beta is 0), I thought that it would be detected as significant since it was forced to mirror the true predictor X1.

Why can't the model detect X2 as significant here?

Here is the reprex (sorry for the verbose code...I am still learning at efficient R):

library(tidyverse)
library(broom)

#Set things up
betas_tbl <- tibble(
  var_prefix = rep("b", 20),
  var_suffix = seq(from = 1, to = 20, by = 1)
) %>%
  mutate(var = str_glue("{var_prefix}_{var_suffix}") %>% as_factor()) %>%
  dplyr::select(var) %>%
  mutate(case_1 = c(1, rep(0, 19))) %>%
  pivot_wider(names_from = "var", values_from = "case_1")

x_tbl <- tibble(
  var_prefix = rep("x", 20),
  var_suffix = seq(from = 1, to = 20, by = 1)
) %>%
  mutate(var = str_glue("{var_prefix}_{var_suffix}") %>% as_factor()) %>%
  dplyr::select(var) %>%
  mutate(case_1 = c(1, rep(0, 19))) %>%
  pivot_wider(names_from = "var", values_from = "case_1")

#Simulated Experiment
full_wide_tbl <- betas_tbl %>%
  bind_cols(x_tbl) %>%
  slice(rep(1:n(), each = 500)) %>%
  rowwise() %>%
    mutate(across(x_1:x_20, ~ rnorm(1, mean = 0, sd = 1))) %>%
    mutate(x_2 = x_1 + rnorm(n=1, mean = 0, sd = 1)) %>% #force x_1 to be a cause of x_2; they will be correlated
    mutate(
         bx_1 = b_1*x_1,
         bx_2 = b_2*x_2,
         bx_3 = b_3*x_3,
         bx_4 = b_4*x_4,
         bx_5 = b_5*x_5,
         bx_6 = b_6*x_6,
         bx_7 = b_7*x_7,
         bx_8 = b_8*x_8,
         bx_9 = b_9*x_9,
         bx_10 = b_10*x_10,
         bx_11 = b_11*x_11,
         bx_12 = b_12*x_12,
         bx_13 = b_13*x_13,
         bx_14 = b_14*x_14,
         bx_15 = b_15*x_15,
         bx_16 = b_16*x_16,
         bx_17 = b_17*x_17,
         bx_18 = b_18*x_18,
         bx_19 = b_19*x_19,
         bx_20 = b_20*x_20,
         resid = rnorm(1, mean = 0, sd = .3)) %>%
  ungroup() %>%
    mutate(observation = rowSums(select(., matches("bx|resid"))))

#Visualizing X variables vs. Outcome
full_wide_tbl %>%
  pivot_longer(cols = x_1:x_20, names_to = "var", values_to = "val") %>%
  mutate(var = as_factor(var)) %>%
  ggplot(aes(x=val, y = observation)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~var)

# Model
lm <- lm(observation ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10 + x_11 + x_12 + x_13 + x_14 + x_15 + x_16 + x_17 + x_18 + x_19 + x_20, data = full_wide_tbl)

#Model Doesn't think X2 is significant
tidy(summary(lm))

confint(lm)

Based on the plots of observation vs X, I think the code is doing what I want. There is clearly a positive slope in the X2 vs Observation plot. Based on this I figured lm() would estimate the beta2 at around 1 but it never thinks it's very different from zero.

Can anyone help me understand why X2 is never detected as significant here?

Thank you!

startz · August 1, 2021, 8:09pm

The whole point of a multiple regression is to separate the effects of correlated independent variables. So you are getting the right answer! (Well, the coefficient should be randomly significant at the 5% level 1 time out of 20 )

tw0handt0uch · August 1, 2021, 8:52pm

Thank you for confirming. I actually just found this answer on cross validated which basically shows the theory behind what I am seeing using a less convoluted example. I think I get it now.

system · August 8, 2021, 8:53pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.