I have a corpus named "Mow_corp_lite" with 203k elements and 812.5 MB size. I am trying to tokenize the corpus into bigrams and then summarize the bigrams in a wordcloud.
Take a look at the quanteda package. It will do bi-grams, tri-grams, proximity grams, and it's an easy conversion of corpus objects from tm or tidytext
Thanks very much. I will start playing with that package later today.
I'm reasonably sure the script above is not working due to a problem I created by trying to compress a corpus prior to tokenizing and converting to a matrix.
The motivation for compressing the corpus was to reduce the object size to something my machine can handle in RAM when creating the bigram matrix.
An alternate approach -- I think -- will be to make the dataset more sparse.
Just to confirm the obvious; you've taken out stopwords, of course.
Mow_cleanest is nowhere near as large as Moby Dick, for example, so unless you're struggling with 4GB RAM, memory shouldn't be a problem.
I'm not sure what package you're working in, but you might also take a look at tidytext(https:://github.com/dgrtwo/tidy-text-mining) and work the examples in Chapter 4
Thank you. I'm using tm and qdap. I'm following the process laid out by Ted Kwartler in his very helpful datacamp text mining course.
I have a vector of words sized 1.1MB. When unlisted and split, it has 65k entries.
The as.matrix function was throwing errors about not being able to process 4.2GB file. I have max 7.87GB usable.
I sorted the words in descending order to create a custom stoplist, and then removed all but the top 2000 (in terms of frequency). I named that character vector "Mow_trimmed":
I just went back to the tm manual; it doesn't have bi-grams, but qdap::ngrams does. It doesn't take a corpus as an argument; it wants a text object.
ngrams(text.var, grouping.var = NULL, n = 2, ...)
This is my first look at qdap and there are other functions that will take either a text.var object or a word frequency matrix, but they say so explicitly.
The method I've been trying to follow uses the RWeka package. Sorry for being unclear about that earlier!
# Make tokenizer function
tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(text_corp, control = list(tokenize = tokenizer))
I'm sure the method works. And maybe challenging myself to figure what I've overlooked with that method is part of learning to troubleshoot similar hangups in the future.