I downloaded the csv from Kaggle, first I used Naive Bayes, and then next tried to use decision trees to compare the two methods, but am having trouble with rpart. I've included all the code to the point of where I get the error in the event that helps. I am new to coding so I'm not sure if that helps or not. I've tried searching for
#Check the structure of the dataframe, the label is integer, should be factor
#so all the same labels get grouped together later on.
str(digitInfo)
#Changing label to type factor
digitInfo$label<-as.factor(digitInfo$label)
str(digitInfo)
#Create training data and testing data, Then load the e1071 package.
Sample<- as.integer(nrow(digitInfo)/3)
Sample1<-sample(nrow(digitInfo),Sample)
(DigTest<-digitInfo[Sample1,])
(DigTrain<-digitInfo[-Sample1,])
Make sure label is factor type
str(DigTest)
Copy the Labels
(TestLabels <- DigTest[,1])
str(DigTest)
Remove the labels
(DigTestNOLabel <- DigTest[,-c(1)])
library(e1071)
#The average number of times 28x21 was used by the numbers 7 & 9
#were 0.45 & 0.08 respectively & their standard deviations were 7.43 for 7 and 2.03 for 9.
(NBe1071<-naiveBayes(DigTrain, DigTrain$label, laplace = 1))
NBe1071Pred <- predict(NBe1071, DigTestNOLabel)
NB_e1071
table(NBe1071Pred,TestLabels)
(NBe1071Pred)
Visualize
plot(NBe1071Pred)
#Decision Tree
#Got an error when trying to run rpart, the error was variable lengths. When I searched the error, the results
#say to make sure there are no NAs in your data, then check that the data type, Everything looks okay.
sum(is.na(digitInfo))
sum(is.na(DigTrain))
str(DigTrain)
str(digitInfo)
fitTrain<- rpart(digitInfo$label ~ . , data = DigTrain, method="class")
Error in model.frame.default(formula = digitInfo$label ~ ., data = DigTrain, :
variable lengths differ (found for 'X1x1')