Why lm() model are not univocal?

Dobrokhotov1989 · November 13, 2023, 3:31pm

Hi there,

Why lm(y ~ x) and lm(x ~ y) return seemingly irrelevant coefficients? See reprex below.

suppressWarnings(library(tidyverse))
set.seed(651)
# Random line with noise
a <- rnorm(1, 2)
b <- rnorm(1, 100, sd = 50)
df <- tibble(x = 1:1000,
             y = a * x + b + rnorm(1000, sd = 500))

# Linear model
(lin_mod <- lm(y ~ x, data = df))
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Coefficients:
#> (Intercept)            x  
#>     130.636        1.299

# With simple math we can rearrenge the formula for x
# y = a*x + b
# x = 1/a * y - (b/a)
(expected_slope <- 1/lin_mod$coefficients[[2]])
#> [1] 0.7698294
(expected_intercept <- -(lin_mod$coefficients[[1]]/lin_mod$coefficients[[2]]))
#> [1] -100.5674

# Linear model with swapped coordinates
(lin_mod_swap <- lm(x ~ y, data = df))
#> 
#> Call:
#> lm(formula = x ~ y, data = df)
#> 
#> Coefficients:
#> (Intercept)            y  
#>     281.092        0.281

# Apparently these coefficients do not match with the expected ones
lin_mod_swap$coefficients[[1]] == expected_intercept
#> [1] FALSE
lin_mod_swap$coefficients[[2]] == expected_slope
#> [1] FALSE

ggplot(df,
       aes(x = y,
           y = x)) +
  geom_point(size = 0.5) +
  geom_abline(slope = lin_mod_swap$coefficients[[2]],
              intercept = lin_mod_swap$coefficients[[1]],
              color = 'red') +
  geom_abline(slope = expected_slope,
              intercept = expected_intercept,
              color = 'blue') +
  theme_classic()

^{Created on 2023-11-13 with reprex v2.0.2}

nirgrahamuk · November 13, 2023, 4:10pm

you can use dplyr::near , to check near enough equality of floating point numbers. This is an issue of numeric precision on floating point representation.

library(tidyverse)
tru_mult <- 1.299
tru_intercept <- 130.636        

(frm_1<-tibble(x=1:100) |> mutate(
  y= tru_mult*x+tru_intercept
))

lm_1 <- lm(y~x,data=frm_1)

coefficients(lm_1)

dplyr::near(coefficients(lm_1)[[1]],tru_intercept)
dplyr::near(coefficients(lm_1)[[2]],tru_mult)

lm_swap <- lm(x~y,data=frm_1)
coefficients(lm_swap)

dplyr::near(coefficients(lm_swap)[[1]],-tru_intercept/tru_mult)
dplyr::near(coefficients(lm_swap)[[2]],1/tru_mult)

startz · November 13, 2023, 5:04pm

That's not how a regression works. One estimate is minimizing errors in the vertical direction and the other is minimizing errors in the horizontal direction. No reason the slopes should be the same.

Dobrokhotov1989 · November 13, 2023, 5:46pm

If the linear model is an 'optimal' representation of data, then swapping coordinates shouldn't have such an effect. But thank you, anyway.

startz · November 13, 2023, 5:50pm

You have to define what you mean by optimal. A regression minimizes the sum of squared errors in the dimension of the dependent variable. A regression is optimal in that sense. If you want some other optimality property than you need to do something other than least squares. (Perhaps you want to look at an "orthogonal regression.")

Dobrokhotov1989 · November 13, 2023, 5:53pm

It is not an issue of numeric precision.
If you add 'error' term to y to make it more "real world", you'll have completely different results:

(frm_1<-tibble(x=1:100) |> mutate(
  y= tru_mult*x+tru_intercept + rnorm(100, sd = 1)
))

dmenne · November 20, 2023, 3:54pm

This question is often asked when you want to compare two methods to measure the same thing. See here for a discussion

system · November 27, 2023, 3:55pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.