I am trying to generate a linear regression to predict the sale price of some cars from the library "imports85", My code is as follows: ´
library(tidyverse)
library(rpart)
library(rpart.plot)
library(randomForest)
data("imports85")
db<-imports85
View(db)
db<-db[,-1 ]
db<-db[,-1 ]
set.seed(0)
library(fastDummies)
library(naniar)
vis_miss(db)
db <- na.omit(db)
vis_miss(db)
db2 <-dummy_cols(db, select_columns=c("make", "fuelType", "aspiration", "numOfDoors",
"bodyStyle", "driveWheels", "engineLocation",
"engineType", "numOfCylinders", "fuelSystem"),remove_first_dummy=T,
remove_selected_columns=T )
ind <- sample(2, nrow(db2), replace = TRUE, prob = c(0.5, 0.5))
train2 <- db2[ind==1,]
test2 <- db2[ind==2,]
model <- lm(price ~ ., data = train2)
summary(model)
classPred2 <- predict(object = model, test2)
classPred2
My first question comes from model <- lm(price ~ ., data = train2)
. Since I have a lot of columns in the matrix, I cannot write all of them at the right side of the ~
. Am I using this way the price to predict the price. Should I remove it from the right part somehow?
My second question comes from classPred2 <- predict(object = model, test2)
, I don't know how the prediction works, since I am using test2, which includes the price, which is the variable that I am trying to predict. Should I remove the column in question?
Any answer is appreciated.
Best regards.