I am working in R with stock prices (sp500) and text documents. I downloaded financial news which already passed the preprocessing process with the help of the tm package. I created a corpus, cleaned the documents and transformed it into a document term matrix (dtm) / term document matrix (tdm) and a tidy dtm with the help of the tidytext package (so all 3 matrixes are available). From now on I have all the necessary data to build a model for predicting the stock price based on this news. I am quite new in this field and have read that first I need to separate the data into a training and test set. How can I do that with my dtm data? And should I also do that with the stock price data? Further, I would like to use svm (support vector machine), because so many literatures used this kind of method in news prediction. How can I build an svm with my given code and run the prediction to get the accuracy. I would be so happy if somebody can help me because I am trying for weeks and couldn't find that much code on the internet especially for my topic. I really appreciate any help. PS: I tried to split the data it worked, but is this correct?
# first cleaning procedure done before this code
# Create Corpus from local news data
docs <- Corpus(DirSource(directory = "D:/Financial_News_Prediction/Edgar filings_full text/Form 8-K", recursive = TRUE))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("en"))
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})
docs <- tm_map(docs, toSpace, "\\b[a-z]\\b")
docs <- tm_map(docs, toSpace, "\\byevkhtm\\b")
docs <- tm_map(docs, toSpace, "\\bcfr\\b")
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
# Create Document Term Matrix
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.matrix(dtm)
# Create Term Document Matrix
tdm <- TermDocumentMatrix(docs)
tdm <- removeSparseTerms(tdm, 0.99)
tdm <- as.matrix(tdm))
# DTM Tidy
dtm.tidy <- DocumentTermMatrix(docs)
dtm.tidy <- tidy(dtm.tidy)
# Data Split
id_train <- sample(nrow(dtm),nrow(dtm)*0.70)
dtm.train = dtm[id_train,]
dtm.test = dtm[-id_train,]
I really appreciate any help and would be more than thankful!