Hello. I build a small simulation to help me understand simple linear regression estimation of coefficient p-values. The eventual goal was to examine step-wise regression and determine how often it was wrong (selecting redundant variables) under various conditions of noise and variable correlation. But now I've gotten myself confused on the very basics...
In this code below, I setup a simulation using 20 predictor variables as follows:
- Set the beta coefficients for 20 variables such that the first has a coefficient of 1 and all the rest are 0.
- Draw the x's for each of the 20 variables from a standard normal distribution
- Modify X2 (one of the vars with 0 coeficient) to be equal to X1 + some noise (X2 = X1 + rnorm(1, 0, 1)
- Create the observations / outcome variable by multiplying each Beta coeficient by its X value, summing them, then adding a residual error to the total (rnorm(1, 0, .3)).
The goal of all of this was to see how often the model identified variable 2 as being significant based on the different noise settings. The problem is that when I run lm() on the final dataset, I can't ever get it to pick up variable 2 as significant. While this is true (its true beta is 0), I thought that it would be detected as significant since it was forced to mirror the true predictor X1.
Why can't the model detect X2 as significant here?
Here is the reprex (sorry for the verbose code...I am still learning at efficient R):
library(tidyverse)
library(broom)
#Set things up
betas_tbl <- tibble(
var_prefix = rep("b", 20),
var_suffix = seq(from = 1, to = 20, by = 1)
) %>%
mutate(var = str_glue("{var_prefix}_{var_suffix}") %>% as_factor()) %>%
dplyr::select(var) %>%
mutate(case_1 = c(1, rep(0, 19))) %>%
pivot_wider(names_from = "var", values_from = "case_1")
x_tbl <- tibble(
var_prefix = rep("x", 20),
var_suffix = seq(from = 1, to = 20, by = 1)
) %>%
mutate(var = str_glue("{var_prefix}_{var_suffix}") %>% as_factor()) %>%
dplyr::select(var) %>%
mutate(case_1 = c(1, rep(0, 19))) %>%
pivot_wider(names_from = "var", values_from = "case_1")
#Simulated Experiment
full_wide_tbl <- betas_tbl %>%
bind_cols(x_tbl) %>%
slice(rep(1:n(), each = 500)) %>%
rowwise() %>%
mutate(across(x_1:x_20, ~ rnorm(1, mean = 0, sd = 1))) %>%
mutate(x_2 = x_1 + rnorm(n=1, mean = 0, sd = 1)) %>% #force x_1 to be a cause of x_2; they will be correlated
mutate(
bx_1 = b_1*x_1,
bx_2 = b_2*x_2,
bx_3 = b_3*x_3,
bx_4 = b_4*x_4,
bx_5 = b_5*x_5,
bx_6 = b_6*x_6,
bx_7 = b_7*x_7,
bx_8 = b_8*x_8,
bx_9 = b_9*x_9,
bx_10 = b_10*x_10,
bx_11 = b_11*x_11,
bx_12 = b_12*x_12,
bx_13 = b_13*x_13,
bx_14 = b_14*x_14,
bx_15 = b_15*x_15,
bx_16 = b_16*x_16,
bx_17 = b_17*x_17,
bx_18 = b_18*x_18,
bx_19 = b_19*x_19,
bx_20 = b_20*x_20,
resid = rnorm(1, mean = 0, sd = .3)) %>%
ungroup() %>%
mutate(observation = rowSums(select(., matches("bx|resid"))))
#Visualizing X variables vs. Outcome
full_wide_tbl %>%
pivot_longer(cols = x_1:x_20, names_to = "var", values_to = "val") %>%
mutate(var = as_factor(var)) %>%
ggplot(aes(x=val, y = observation)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~var)
# Model
lm <- lm(observation ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10 + x_11 + x_12 + x_13 + x_14 + x_15 + x_16 + x_17 + x_18 + x_19 + x_20, data = full_wide_tbl)
#Model Doesn't think X2 is significant
tidy(summary(lm))
confint(lm)
Based on the plots of observation vs X, I think the code is doing what I want. There is clearly a positive slope in the X2 vs Observation plot. Based on this I figured lm() would estimate the beta2 at around 1 but it never thinks it's very different from zero.
Can anyone help me understand why X2 is never detected as significant here?
Thank you!