This first part basically shows what I did to run a LOGIT model and construct a confusion matrix. I was going to provide an image of my results, however new users are limited to 1 photo. In summary I had 26 true positives,11 false negatives,2 false positives and 6 true negatives.
library(caret)
library(MASS)
library(pROC)
# create training and testing data
set.seed(123)
trainIndex = createDataPartition(BCGsub$Status, p = .8,
list = FALSE,
times = 1)
BCGsubTrain = BCGsub[ trainIndex,]
BCGsubTest = BCGsub[-trainIndex,]
##summary(BCGsubTest)
# Fit LOGIT model and choose best model using stepAIC
glm.model=glm(Status~.,data= BCGsubTrain,family="binomial")
summary(glm.model)
glm.fit= stepAIC(glm.model, direction = 'backward')
glm.fit
# Make predictions on test data and construct a confusion matrix
probs.glm = predict(glm.fit,newdata = BCGsubTest ,type = "response")
#range(probs)
pred.glm = rep("NoRelapse", length(probs.glm))
pred.glm[probs > 0.5] = "Relapse"
table(pred.glm,BCGsubTest$Status)
Using what I have done I tried to implement my work into the next question. The next question asked to first normalize the training and testing data. Apply a 5-fold cross validation method on the training set using trainControl() with “CV” method in the caret package.Perform lda, qda and knn classification methods using train() in caret package And to make the predictions on the test data set, While also creating the confusion matrix.
I got reference material online to normalize my data from
https://machinelearningmastery.com/pre-process-your-dataset-in-r/
I need help debugging the code.
If I run the predict() using "raw" I get the same exact results from my LOGIT confusion matrix for all three models, which is very fishy and if I run "probs" I don't get the right number of predictions.
set.seed(123)
trainIndex = createDataPartition(BCGsub$Status, p = .8,
list = FALSE,
times = 1)
BCGsubTrain = BCGsub[ trainIndex,]
BCGsubTest = BCGsub[-trainIndex,]
# normalize the training and testing data
## calculate the pre-process parameters from the dataset
preprocessParams1 = preProcess(BCGsubTrain, method=c("range"))
preprocessParams2 = preProcess(BCGsubTest, method=c("range"))
## transform the dataset using the parameters
BCGsubTrainNorm = predict(preprocessParams1, BCGsubTrain)
BCGsubTestNorm = predict(preprocessParams2, BCGsubTest)
## summarize the transformed dataset
###summary(BCGsubTrainNorm)
###summary(BCGsubTestNorm)
# run a 5-fold Cross-Validation
set.seed(123)
train.control = trainControl(method = "cv", number = 5)
# perform LDA, QDA,KNN
lda.model = train(Status~.,data= BCGsubTrainNorm, method = "lda",
trControl = train.control)
qda.model = train(Status~.,data= BCGsubTrainNorm, method = "qda",
trControl = train.control)
knn.model = train(Status~.,data= BCGsubTrainNorm, method = "knn",
trControl = train.control)
# Make predictions on test data and construct a confusion matrix
probs.lda = predict(lda.model,newdata = BCGsubTestNorm ,type = "response")
#range(probs.lda)
pred.lda = rep("NoRelapse", length(probs.lda))
pred.lda[probs > 0.5] = "Relapse"
table(pred.lda,BCGsubTestNorm$Status) # hw more correct
probs.qda = predict(qda.model,newdata = BCGsubTestNorm ,type = "response")
#range(probs.qda)
pred.qda = rep("NoRelapse", length(probs.qda))
pred.qda[probs > 0.5] = "Relapse"
table(pred.qda,BCGsubTestNorm$Status) # hw more correct
probs.knn = predict(knn.model,newdata = BCGsubTestNorm ,type = "response")
#range(probs.knn)
pred.knn = rep("NoRelapse", length(probs.knn))
pred.knn[probs > 0.5] = "Relapse"
table(pred.knn,BCGsubTestNorm$Status) # hw more correct
Output: