How does lm differentiate polynomial vs multiple regression?

I'm new to R and following along with a course, so forgive me if this question is butchered with core misunderstandings.

I created a multiple linear regression using the following command:

regressor = lm(formula = Profit ~ ., data = training_set)

When plotted, it generated a straight, "best fitting line". I have no idea what mechanism lm uses to compute this, but based on what I read it seems fair to assume it might be solving for it analytically or by using gradient decent to estimate the slope and y-intercept, based on partial derivatives for all independent variables that were passed into it from training_set.

With a different data structure I then modified this other data set to include some new columns:

dataset$Level2 = dataset$Level^2
dataset$Level3 = dataset$Level^3
dataset$Level4 = dataset$Level^4

I then run the same command from above again, and now suddenly my line is curved via a polynomial regression, yet I called the identical lm function.

I'm not sure how this works. My assumption is that multiple linear regression (which outputs a straight line) and a polynomial linear regression (which can output a curve) might have different cost functions, or use different strategies to solve for the variables... if this assumption is true then how is it deciding which to choose based only on my data? How does it know my power columns are not actually normal data columns that should be fit using a multiple linear regression?

1 Like

You're getting different results because the regression models are different.

You haven't shown your data, but the model formula Profit ~ . will use all columns besides Profit as the independent variables. If Level is the only other column, then Profit ~ . is equivalent to Profit ~ Level, which will give a line (a first-order polynomial) for the regression fit.

However, after you create Level2, Level3 and Level4, the formula Profit ~ . is now equivalent to Profit ~ Level1 + Level2 + Level3 + Level4, which will give a fourth-order polynomial as the regression fit and will in general be a curve. (You can get the same regression fit with Profit ~ poly(Level, 4, raw=TRUE)).

As far as the details of how lm is finding the regression coefficients, this blog post has details on the actual algorithms R uses for finding the least squares solution to the linear regression.

1 Like

Ok thanks. What you said makes sense and fits with my understanding, so that probably means I didn't articulate my question well.

In your first example you said: If Level is the only other column...

In my case, it's not. The sample data set I used had 3-4 other IV columns, and lm outputs a straight line. The tutorial called this a multiple linear regression.

Yet in a different example I used the same lm function which had other columns being passed in, and then suddenly the line is curved.

So it seems sometimes lm outputs a straight line and other times it is curved. Maybe it's because I'm a programmer but I'm not used to functions that give different outputs unless I explicitly tell the lm function that "this should be a polynomial regression" or "this should be a multiple linear regression with a straight line"

Edit: PS I will read your article right now.

What lm outputs depends on the specific formula and data you feed to it. If you change either one, you'll in general get a different result. In the y ~ . notation, the . means "all other columns in the data besides y", so if you change the data by adding columns, then you are indeed changing the regression formula when you use the . notation.

The regression fit will be a line (or a plane or a hyperplane if you have more than one continuous independent variable) but you'll only get a curved line (or curved plane or curved hyperplane) if you have non-linear terms in the regression formula.

The . is a convenience. You can always make the formula explicit by writing out the specific columns you want included in the regression. Without seeing your full code and a sample of your data, it's hard to provide any additional advice or commentary beyond that.

Hmm.. I have done so much reading after days of this that everything is blending together. I went back and graphed one variable of the multiple linear regression and it's not straight at all... I have no idea where I got the idea that the predictor line was straight.

I guess I will just continue experimenting until this makes sense.

What you're seeing is due to the fact that each data point not only has a different value for Years of Exprience, each also (in general) has different values for the other independent variables in the regression. That's what's causing the squiggly line, even though the regression "hyperplane" is flat (assuming you have only linear terms in the regression).

Here's an example with the built-in mtcars data frame. The regression below has two independent variables (hp and wt) so the fitted regression equation is a (flat) plane. When I plot the fitted values and connect them (the gray squiggly line) it's all over the place because, in general, points with similar values of hp can have different values of wt. (If you look at the regression coefficients (run coef(m1)) you'll see that the coefficient for wt is -3.88, meaning a car that is 1.0 greater in wt (for a given value of hp) is predicted to be 3.88 lower in mpg.

To see the predicted relationship of hp and mpg separately from wt, we need to plot mpg vs. hp for a fixed value of wt. The plot below shows mpg vs. hp for three specific values of wt (the 10th, 50th, and 90th percentiles in the data). You can see that each of those lines is straight.

library(tidyverse)

# Model
m1 = lm(mpg ~ hp + wt, data=mtcars)

# Model predictions for the data
mtcars$mpg.fitted = predict(m1)

# Model predictions for specific values of each independent variable
newdata = data.frame(hp=rep(seq(min(mtcars$hp), max(mtcars$hp), length=20), 3),
                     wt=rep(quantile(mtcars$wt, prob=c(0.1,0.5,0.9)), each=20))

newdata$mpg.fitted = predict(m1, newdata)

ggplot(mtcars, aes(x=hp)) + 
  geom_point(aes(y=mpg)) +
  geom_line(aes(y=mpg.fitted), colour="grey60") + 
  geom_point(aes(y=mpg.fitted), colour="grey60", size=1) + 
  geom_line(data=newdata, aes(y=mpg.fitted, colour=factor(wt))) +
  labs(colour="wt") +
  theme_bw()

Rplot25

Since this regression model has only two independent variables and the regression function is therefore a plane, we can visualize it with a 3D plot. In the plot below, the plane is the fitted regression equation. The little circles are the data points. The three lines we drew in the 2D plot above are the lines of constant wt on the plane in the plot below:

library(rockchalk)

lf = plotCurves(m1, plotx="hp", modx="wt", modxVals=unique(newdata$wt),
                col=hcl(seq(15,375,length=4)[1:3],100,65), lty=1)
plotPlane(m1, plotx1="hp", plotx2="wt", drawArrows = TRUE, phi=0, theta=115,
          linesFrom=lf)

If we fit a model that is second-order in both independent variables, you can now see that the regression fit is a surface that is curved in both independent-variable directions.

# Model with second-order polynomials for each independent variable
m2 = lm(mpg ~ poly(hp,2,raw=TRUE) + poly(wt,2,raw=TRUE), data=mtcars)

lf = plotCurves(m2, plotx="hp", modx="wt", modxVals=unique(newdata$wt),
                col=hcl(seq(15,375,length=4)[1:3],100,65), lty=1)
plotPlane(m2, plotx1="hp", plotx2="wt", drawArrows = TRUE, phi=0, theta=115,
          linesFrom=lf)

Rplot32

4 Likes

Wow what an incredible reply..................... Thank you!! Let me mull over this very slowly and graph along with your examples.

Ohhhh!.... so fixing the other variables as a constant causes that particular variable to be straight.

Ok I am still re-reading this more times very slowly, this is all very interesting.

What would the gray zig-zag line in the 2d chart...what would that represent in the first 3D plot above?

Or rather, can you also explain what the gray line is representing, but worded differently? I sense what you're saying but it is not clicking yet. Somehow I do understand why they're straight when the other variables are a constant though, as its almost like the combination of multiple simple regressions.

It almost seems like the blue line in my chart is representing the prediction of salary based on years of experience, but in those areas where its zigging and zagging, some other variable is having "more influence" on the salary. Is that correct?

The regression equation is mpg = b_0 + b_1hp + b_2wt, where b_0, b_1 and b_2 are the regression coefficients. Putting in the regression coefficients from model m1 we have: mpg = 37.23 - 0.032*hp - 3.88*wt. This equation (which is the equation of the plane shown in the 3D plot) gives the predicted value of mpg for any combination of hp and wt.

Let's say I want to know the relationship between mpg and hp for hp = c(100, 200, 300). I can plug these values into the equation above to get mpg = 37.23 - 0.032*100 - 3.88*wt, and similarly for hp=200 and hp=300. But what do I choose for wt? As you can see, for any fixed value of hp, the predicted mpg will be different for different values of wt.

You get a zig-zag line when you plot the predicted values for the actual data, because wt can vary for any given value of hp. For example, in the mtcars data frame, there are 3 cars with hp=175 (type mtcars in the console and inspect the data), but each of these cars has a different value of wt. In the plot below, I've highlighted the predicted values for those 3 points in blue, while the other predicted values are in red (the gray points are the actual data). You can see that all of those 3 points have the same value of hp but they have different predicted values for mpg, because each of them has a different wt.

Once again, if we want to see the relationship between mpg and hp, we must hold wt fixed at a single value.

Rplot33

The plot below is the 3D version of the plot above. The blue circles are the data points. The points are the predicted values and are connected by the red line. The red points/red line lie on the regression plane (as they must, because the plane represents the predicted value of mpg for any combination of hp and wt), while the blue data points can lie above or below the regression plane.

The vertical distance from the data point to the regression plane is the residual for that point (the difference between the actual mpg value for that data point and the model's predicted mpg value for that data point. Every blue data point lies directly above, below or on its corresponding red prediction point.

The red line the 2D graph above is what you get when you project the regression plane onto the hp-mpg plane (the front face of the cube in the 3D graph).

p=plotPlane(m1, plotx1="hp", plotx2="wt", drawArrows = TRUE, phi=0, theta=-25,
            alty=1.5, alength=0.05, acol="blue")
fit.line = with(mtcars[order(mtcars$hp),], trans3d(hp, wt, mpg.fitted, pmat=p$res))
lines(fit.line, lwd=1, col="red")
points(fit.line, pch=16, cex=0.8, col="red")

1 Like

Wow, that is super awesome... I definitely get it now conceptually, your graphs are immensely helpful, right now I I'm just trying to visualize it. I think that will take some playing around with the data sets.

SKIP ME So one way I'm sort of picturing it is that... that red zig zag line from hp "adds up" to a hyperplane when added to the zig zag line of wt. is this a wrong way to think about it? I don't necessarily mean on a 2D graph, although that would be interesting to see. I imagine it might be nonsensical... but in terms of the hyperplane, lets say that the "dip" in mpg at around ~200 HP... could that mean that whatever point that is, wt had some very strong influence because it "stole" away some of the predicted mpg away which caused the line to drop?

Edit: actually after thinking about this awhile I don't think thats right. I don't know why my mind wants to make the terms add up to the plane. It's no different than the points in a simple linear regression not adding up to a line. I think it's because I get the sense that the jagged line where it dips more means that the value of the weight was larger, as if its "sharing" the prediced value amongst the independent variables...

One funny side comment... I am lucky we live in 3 dimensions, as I at least get ONE dimension of "practice" to help me visualize higher order functions, such as a neural net function with 100,000 inputs. :sweat_smile:I may not be able to visualize 4th and beyond, but at least seeing this in 3 dimensions gives a good sense of it.

@joels I don't want to take any more of your time with simple questions, but I do have just one which relates to my original question.

How does R know to do a flat hyperplane vs a curved one, based only on the original data set but without telling it "please do a polynomial regression instead of a flat-plane multiple regression". I see you wrote if you have non-linear terms in the regression formula., but how is this being decided if I'm not telling lm which formula to use? Is it done because it tries many different models (multiple, polynomial, etc) and it just returns the best one?

The formula (e.g., mpg ~ hp + wt) tells the lm (or other modeling function) what mathematical function to use in modeling the data. There's no separate way that you would tell R whether to do a "polynomial regression" or "flat plane" regression (a flat plane is still a polynomial, but it's a first-order polynomial in both independent variables).

The shape of the regression surface (or hyperplane) you end up with is determined by the type of function used to model the data. The data are used to find the optimal coefficients for the given functional form (where, for the lm function, optimal means minimizing the sum of the squared residuals, but could mean something different for other types of models).

  • mpg ~ hp + wt fits the mathematical function: mpg = b_0 + b_1hp + b_2wt. This is inherently a flat plane.
  • mpg ~ poly(hp, 2) + wt fits the mathematical function: mpg = b_0 + b_1hp + b_2hp^2 + b_3wt. This is inherently a curved surface (flat in the wt direction and parabolic in the hp direction).
  • mpg ~ hp*wt fits the mathematical function: mpg = b_0 + b_1hp + b_2wt + b_3(hp \cdot wt). This is inherently a curved surface, due to the interaction between hp and wt (the b_3(hp \cdot wt) term).

There are many more options for the mathematical function used to model the data (e.g., splines, generalized additive models, etc.).

All the lm (or other model) function "knows" is that it should find coefficient values that minimize the sum of the squared residuals (or find optimal coefficients based on some other "loss function" depending on the model type) for the given mathematical function used to model the data.

1 Like
  • deleted * while I think about this, I think I made a mistake

Thank you very much :pray:t3: I really appreciate your above and beyond help and amazing charts.

Ok I deleted my whole reply because I decided to just test it.

mtcars$hp2 = mtcars$hp^2
m2 = lm(mpg ~ hp2 + wt, mtcars)

lf = plotCurves(m2, plotx="hp2", modx="wt", modxVals=unique(newdata$wt),
                col=hcl(seq(15,375,length=4)[1:3],100,65), lty=1)
plotPlane(m2, plotx1="hp2", plotx2="wt", drawArrows = TRUE, phi=0, theta=115,
          linesFrom=lf)

This does actually not graph a curved plane, which is really what I was trying to articulate to begin with in a way. In the course I'm learning from, this is the method he used and he basically just says "now it magically works, see...now its a polynomial regression [instead of multiple regression from previous chapter]". He literally changed nothing about the command and only added the line above it, modifying one of the original columns.... and it was somehow a polynomial regression.

I have no idea how this is working in his case, but when I plot this it's still a flat plane. When I use poly() it becomes curved... so. I don't know what poly is doing differently but it appears lm is smart enough to now "see" the exponents... so NOT based on the data necessarily but quite literally the formula you are passing in as input into lm.

This was originally causing me a lot of turmoil because it looked like simply by changing only the data, the formula to calculate the regression would completely change.

Heres his exact commands

# Fitting Linear Regression to the dataset
lin_reg = lm(formula = Salary ~ .,
             data = dataset)

# Fitting Polynomial Regression to the dataset
dataset$Level2 = dataset$Level^2
dataset$Level3 = dataset$Level^3
dataset$Level4 = dataset$Level^4
poly_reg = lm(formula = Salary ~ .,
              data = dataset)

As you can see the commands are identical except that it appears to be mutating the dataset before passing it in? So I have no idea how in his lessons lm knows to change its formula when the original dataset is changed, unless those 3 lines are somehow storing metadata inside dataset that I'm not aware of? And lm is able to read the metadata to see that it's an exponent column and to use a different formula?

I'll expand on my first response, where I noted that the model formula y ~ . will use all columns in the data as independent variables. Thus, if you add a column to the data, such as the square of an existing column, and run the model again, you'll now have a new model with an extra term for the squared variable, and your model will now be a second-order polynomial. Here's a demonstration using variations on the same mtcars model I created above:

library(tidyverse)
library(broom)

# Start with a clean data frame with just three columns
data(mtcars)
df = mtcars[ , c("mpg", "hp", "wt")]
# Create two models, one with the "dot" formula and one with the variables 
#  listed explicitly. These models are the same.
m1a = lm(mpg ~ ., data=df)
m1b = lm(mpg ~ hp + wt, data=df)

# Add column by squaring hp
df$hp2 = df$hp^2

# This model is now different from model m1a, because the formula 
#  (implicitly) includes the extra column hp2 that we just added to the data.
# Model m2b is just the explicit version of `mpg ~ .`
m2a = lm(mpg ~ ., data=df)
m2b = lm(mpg ~ hp + hp2 + wt, data=df)

# Here are two other way to create the same model without using column hp2
m2c = lm(mpg ~ hp + I(hp^2) + wt, data=df)
m2d = lm(mpg ~ poly(hp, 2, raw=TRUE) + wt, data=df)

# Put all the model objects we just created into a list called ml
ml = list(m1a=m1a, m1b=m1b, m2a=m2a, m2b=m2b, m2c=m2c, m2d=m2d)

Now let's look at the output of each of these models:

Extract the model formula

This just shows what we already did. But keep in mind that we changed the data between model m1b and model m2a

map(ml, ~.x$call)
#> $m1a
#> lm(formula = mpg ~ ., data = df)
#> 
#> $m1b
#> lm(formula = mpg ~ hp + wt, data = df)
#> 
#> $m2a
#> lm(formula = mpg ~ ., data = df)
#> 
#> $m2b
#> lm(formula = mpg ~ hp + hp2 + wt, data = df)
#> 
#> $m2c
#> lm(formula = mpg ~ hp + I(hp^2) + wt, data = df)
#> 
#> $m2d
#> lm(formula = mpg ~ poly(hp, 2, raw = TRUE) + wt, data = df)

Show the actual variables (data columns and/or transformations of data columns) that went into each model

map(ml, ~attr(terms(.x), "variables"))
#> $m1a
#> list(mpg, hp, wt)
#> 
#> $m1b
#> list(mpg, hp, wt)
#> 
#> $m2a
#> list(mpg, hp, wt, hp2)
#> 
#> $m2b
#> list(mpg, hp, hp2, wt)
#> 
#> $m2c
#> list(mpg, hp, I(hp^2), wt)
#> 
#> $m2d
#> list(mpg, poly(hp, 2, raw = TRUE), wt)

Show the fitted coefficients for each model.

Note that each m1 series model has the same coefficients and each m2 series model has the same coefficients (but note the order of the coefficients is different for model m2a).

map(ml, coef)
#> $m1a
#> (Intercept)          hp          wt 
#> 37.22727012 -0.03177295 -3.87783074 
#> 
#> $m1b
#> (Intercept)          hp          wt 
#> 37.22727012 -0.03177295 -3.87783074 
#> 
#> $m2a
#>   (Intercept)            hp            wt           hp2 
#> 41.0766024879 -0.1158391219 -3.0300128944  0.0002207219 
#> 
#> $m2b
#>   (Intercept)            hp           hp2            wt 
#> 41.0766024879 -0.1158391219  0.0002207219 -3.0300128944 
#> 
#> $m2c
#>   (Intercept)            hp       I(hp^2)            wt 
#> 41.0766024879 -0.1158391219  0.0002207219 -3.0300128944 
#> 
#> $m2d
#>              (Intercept) poly(hp, 2, raw = TRUE)1 poly(hp, 2, raw = TRUE)2 
#>            41.0766024879            -0.1158391219             0.0002207219 
#>                       wt 
#>            -3.0300128944

Created on 2020-10-18 by the reprex package (v0.3.0)

1 Like

If you were to solve for the coefficients by hand on a piece of paper in your last data dump... would m1a/m1b and m2a/m2b require "different approaches" to fit those models?

One assumption I have is that data similar to m1a/m1b is fitted by solving for the coefficients of:

y = b0 + b1x1 + b2x2 + b3x3 + …

And data similar to m2a/m2b is fitted by solving for the coefficients of:

y = b0 + b1x1 + b1x1^2 + b1x1^3 + …

In other words... different strategies are required internally by lm to fit those models.

Is that an incorrect assumption?

How is it using one magical regression function for 2 completely different types of regressions, without explicitly telling it how to approach the data?

No. Each term has a separate coefficient, and you can see this in the examples above, where the m2 series of models has a different coefficient for hp and hp^2. The function in your example should be rewritten to the following:

y = b_0 + b_1x_1 + b_2x_1^2 + b_3x_1^3 + …

The models are fit the same way in each case, but the mathematical function being fit is different.

1 Like

Ahh... very interesting. I guess there's a lot I don't understand then, I assumed internally it would compute these regressions using different strategies based on the type of regression. If it computes them the same way then.... well I suppose I'll just have to accept I don't understand and move on. I guess I'm used to reading about 50 tutorials where people are solving regressions by hand using a very specific strategy, and specific formulas that are NOT the same, and here comes lm magically which can fit regressions by some internal method which doesnt need to be told how to work, it just magically works. As a programmer, I tend to really not like things that are magic, but I feel I have tugged on this string to try to understand the magic and it's just leading to 100 more questions, so I should accept defeat probably and just move on...