currently I've some struggle with preparing a set of TXT (comma separated) files in order to do some topic modelling (LDA) with the corresponding data set.
Basically I've got some txt files with about 10 columns where lets say column 4 includes some written text.
My idea was to create da dataframe of all my TXT-files (>>100), then create an subset with just the text columns and convert this one into a Corpus followed by some preprocessing steps etc. Actually I get some error message at this point.
Code looks like this:
library(tm)
library(wordcloud)
workingDir <- "/my/dir/"
fileList <- list.files(path=workingDir, pattern=".txt")
fileList <- paste(workingDir, "//", fileList, sep="")
# create the corpus
dataList <- lapply(fileList, FUN=readLines)
dataList <- lapply(dataList, FUN=paste, collapse=" ")
#Create Corpus
amz_corpus <- Corpus(DataframeSource(dataList))
#Cleaning up the text
Error message is this one:
Fehler in DataframeSource(dataList) :
all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
Your problem is with tm::DataframeSource() accordingly to the documentation, you have to pass a dataframe with at least two columns named "doc_id" and "text", so you could do something like this (considering each .txt file as a "document" for the corpus).