Why lm() model are not univocal?

Hi there,

Why lm(y ~ x) and lm(x ~ y) return seemingly irrelevant coefficients? See reprex below.

suppressWarnings(library(tidyverse))
set.seed(651)
# Random line with noise
a <- rnorm(1, 2)
b <- rnorm(1, 100, sd = 50)
df <- tibble(x = 1:1000,
             y = a * x + b + rnorm(1000, sd = 500))

# Linear model
(lin_mod <- lm(y ~ x, data = df))
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Coefficients:
#> (Intercept)            x  
#>     130.636        1.299

# With simple math we can rearrenge the formula for x
# y = a*x + b
# x = 1/a * y - (b/a)
(expected_slope <- 1/lin_mod$coefficients[[2]])
#> [1] 0.7698294
(expected_intercept <- -(lin_mod$coefficients[[1]]/lin_mod$coefficients[[2]]))
#> [1] -100.5674

# Linear model with swapped coordinates
(lin_mod_swap <- lm(x ~ y, data = df))
#> 
#> Call:
#> lm(formula = x ~ y, data = df)
#> 
#> Coefficients:
#> (Intercept)            y  
#>     281.092        0.281

# Apparently these coefficients do not match with the expected ones
lin_mod_swap$coefficients[[1]] == expected_intercept
#> [1] FALSE
lin_mod_swap$coefficients[[2]] == expected_slope
#> [1] FALSE

ggplot(df,
       aes(x = y,
           y = x)) +
  geom_point(size = 0.5) +
  geom_abline(slope = lin_mod_swap$coefficients[[2]],
              intercept = lin_mod_swap$coefficients[[1]],
              color = 'red') +
  geom_abline(slope = expected_slope,
              intercept = expected_intercept,
              color = 'blue') +
  theme_classic()

Created on 2023-11-13 with reprex v2.0.2

you can use dplyr::near , to check near enough equality of floating point numbers. This is an issue of numeric precision on floating point representation.

library(tidyverse)
tru_mult <- 1.299
tru_intercept <- 130.636        

(frm_1<-tibble(x=1:100) |> mutate(
  y= tru_mult*x+tru_intercept
))

lm_1 <- lm(y~x,data=frm_1)

coefficients(lm_1)

dplyr::near(coefficients(lm_1)[[1]],tru_intercept)
dplyr::near(coefficients(lm_1)[[2]],tru_mult)

lm_swap <- lm(x~y,data=frm_1)
coefficients(lm_swap)

dplyr::near(coefficients(lm_swap)[[1]],-tru_intercept/tru_mult)
dplyr::near(coefficients(lm_swap)[[2]],1/tru_mult)

That's not how a regression works. One estimate is minimizing errors in the vertical direction and the other is minimizing errors in the horizontal direction. No reason the slopes should be the same.

1 Like

If the linear model is an 'optimal' representation of data, then swapping coordinates shouldn't have such an effect. But thank you, anyway.

You have to define what you mean by optimal. A regression minimizes the sum of squared errors in the dimension of the dependent variable. A regression is optimal in that sense. If you want some other optimality property than you need to do something other than least squares. (Perhaps you want to look at an "orthogonal regression.")

2 Likes

It is not an issue of numeric precision.
If you add 'error' term to y to make it more "real world", you'll have completely different results:

(frm_1<-tibble(x=1:100) |> mutate(
  y= tru_mult*x+tru_intercept + rnorm(100, sd = 1)
))

This question is often asked when you want to compare two methods to measure the same thing. See here for a discussion

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.