hello sir,
I have been having trouble with the predict function underestimating (or overestimating) the predictions for new text category (or it's class if thay sport...health....politcs)
firstly i import a tdm matrix of my corpus than
split my data training / test to use in my modele with knn algorithem
and it's works fine
now i need to import new text unknown gategory to pridect it but i did not how to do that
i don't how to use pridect function
# KNN model
# Stemming words
# CrossTable
# Read csv with columns: Document , Terms and category
PathFile <- read.csv(file.choose(), sep =";", header = TRUE)
PathFilenameUnk<-read.csv(file.choose(), sep =",", header = TRUE)
#Strectur of Csv file
# Split data by rownumber into two equal portions
train <- sample(nrow(PathFile), ceiling(nrow(PathFile) * .70))
test <- (1:nrow(PathFile))[- train]
##Show Training Data
##Show Test Data
# Isolate classifier
cl <- PathFile[, "Category"]
# Create model data and remove "category"
modeldata <- PathFile[,!colnames(PathFile) %in% "Category"]
# Create model: training set, test set, training set classifier
knn.pred <- knn(modeldata[train, ], modeldata[test, ], cl[train], 70)
# Confusion matrix
conf.mat <- table("Predictions" = knn.pred, Actual = cl[test])
CrossTable(x = cl[test], y = knn.pred, prop.chisq=FALSE)
predict(knn.pred,PathFilenameUnk) ### error here!!!!
# Accuracy
(accuracy <- sum(diag(conf.mat))/length(test) * 100)
# Create data frame with test data and predicted category
df.pred <- cbind(knn.pred, modeldata[test, ])
write.table(df.pred, file="output.csv", sep=";")
and here is my csv file:
i know i had de the same step for unknown text and import theme as dtm matrix
but i some thing wrong !!!
thanks an advence
TDM_2018_05_09_225323.csv this orignal file i use with this script
Predict_TDM_2018_05_09_225025.csv this file is what i need to pridect how to use it with pridect function
here is my file