I have a data frame of 392 row and 7 independent variables, with mpg being the dependent variable. I'm not using the last variable 'name' in my model as it is a factor.
Here is the top of the table:
mpg cylinders displacement horsepower weight acceleration year origin
1 18 8 307 130 3504 12.0 70 1
2 15 8 350 165 3693 11.5 70 1
3 18 8 318 150 3436 11.0 70 1
4 16 8 304 150 3433 12.0 70 1
5 17 8 302 140 3449 10.5 70 1
6 15 8 429 198 4341 10.0 70 1
name
1 chevrolet chevelle malibu
2 buick skylark 320
3 plymouth satellite
4 amc rebel sst
5 ford torino
6 ford galaxie 500
Here is the structure of the data frame:
data.frame: 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : int 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..:
For variable origin: 1 American, 2 German, 3 Japanese
I left this variable as integer for the lm model.
I ran lm and reduced the model to the significant variables:
Call:
lm(formula = auto.mpg$mpg ~ auto.mpg$displacement + auto.mpg$horsepower +
auto.mpg$weight + auto.mpg$year + auto.mpg$origin)
Residuals:
Min 1Q Median 3Q Max
-9.4882 -2.1157 -0.1645 1.8650 13.0544
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.669e+01 4.120e+00 -4.051 6.16e-05 ***
auto.mpg$displacement 1.137e-02 5.536e-03 2.054 0.0406 *
auto.mpg$horsepower -2.192e-02 1.078e-02 -2.033 0.0428 *
auto.mpg$weight -6.324e-03 5.685e-04 -11.124 < 2e-16 ***
auto.mpg$year 7.484e-01 5.089e-02 14.707 < 2e-16 ***
auto.mpg$origin 1.385e+00 2.772e-01 4.998 8.80e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All of these variables are either numeric or integer.
The only factor variable is "name" and that is not used in the lm model.
Once I have the lm model, I plot the residuals which display nicely (random and equal variance).
When I try to add the best fit line, I get the error:
Warning message:
In abline(auto.mpg.linear, col = "red") :
only using the first two of 6 regression coefficients`
My research shows this error is due to a variable being a factor in the linear model,
but none of my regression inputs is a factors.
Any suggestions on how to get this to work.
FYI - I'm new to R programming and am trying to make this work with base functions in R.