I have a data frame which is a collection of tweets, I want to find the sum of the matches for one of the columns against another dataframe I'm using to lookup.
At the moment I've written a function (below) which I then lapply through but it's very slow as it's using a for loop.
word_count <- function(name) {
word_sum <- 0
for (i in 1:nrow(lookup)) {
value <- grepl(lookup$word[i], word)
word_sum <- word_sum + value
}
word_sum <- word_sum/nrow(lookup)
return(word_sum)
}
Then I lapply(tweets$name, word_count)
This is particularly slow (2hrs for ~20k) and I'm sure there's a better way. I've looked into purrr::map but my brain can't quite compute. Can anyone help?
Hi Chris, if you're actually analyzing tweets, you may want to consider using thetidytext package, here is the link to the book: https://www.tidytextmining.com/
Thanks for the link it has been a help. Whilst I can't use the method you have at the end (as I'm looking for a partial match rather than a full match) it does suggest some other avenues I'll look at.
I used the bioinformatics package, Biostrings, for a similar problem a while ago. Here is short example using text lines from janeaustenr and a handful of lookup words:
require(Biostrings)
require(janeaustenr)
# create BStringSet object from book lines
tt=austen_books()$text
subj = BStringSet(tt)
# create another BStringSet from a curated set of words to locate
lookup = BStringSet(c("read","rabbit","cousin", "polite", "civil"))
# count word instances on each line with vcountPDict, then summarize
word.counts = colSums(vcountPDict(lookup, subj))
You'll have to install Biostrings through Bioconductor.
It takes a few seconds to run, but it only matches a small set of words. The example above also doesn't recognize word boundaries in the subj lines; you'd have to add spaces to the strings in lookup to ensure that a word like "read" didn't match "readily".