Topic Modelling Preprocessing CSV's

Flocke · July 28, 2019, 8:31am

Hey there,

currently I've some struggle with preparing a set of TXT (comma separated) files in order to do some topic modelling (LDA) with the corresponding data set.

Basically I've got some txt files with about 10 columns where lets say column 4 includes some written text.

My idea was to create da dataframe of all my TXT-files (>>100), then create an subset with just the text columns and convert this one into a Corpus followed by some preprocessing steps etc. Actually I get some error message at this point.

Code looks like this:

library(tm)
library(wordcloud)
workingDir <- "/my/dir/"
fileList <- list.files(path=workingDir, pattern=".txt")
fileList <- paste(workingDir, "//", fileList, sep="")
# create the corpus
dataList <- lapply(fileList, FUN=readLines)
dataList <- lapply(dataList, FUN=paste, collapse=" ")

#Create Corpus
amz_corpus <- Corpus(DataframeSource(dataList))

#Cleaning up the text

Error message is this one:

Fehler in DataframeSource(dataList) : 
  all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

andresrcs · July 28, 2019, 11:42am

Any chance you could share a link to at least a couple of those csv files so we can reproduce your code and try to help you?

Flocke · July 28, 2019, 12:44pm

Hey andresrcs,

thanks for your quick reply!

Sure, finde a bunch of them over here:

andresrcs · July 28, 2019, 1:56pm

Your problem is with tm::DataframeSource() accordingly to the documentation, you have to pass a dataframe with at least two columns named "doc_id" and "text", so you could do something like this (considering each .txt file as a "document" for the corpus).

library(tidyverse)
library(tm)

list_of_files <- list.files(path = "data/", # Modify this as needed
                            pattern = ".txt",
                            full.names = TRUE)
dataList <- list_of_files %>% 
    setNames(nm = .) %>% 
    map_dfr(~as.data.frame(readLines(.x),
                           stringsAsFactors = FALSE),
            .id = "doc_id") %>% 
    group_by(doc_id) %>% 
    summarise(text = paste(`readLines(.x)`, collapse = " "))

amz_corpus <- Corpus(DataframeSource(dataList))

Flocke · July 28, 2019, 7:24pm

Hey andresrcs,

the hint about the col names did it for me.

Thank you so much for spending your time helping me over here, big karma!

system · August 18, 2019, 7:24pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.