Simply word frequency in one variable

Slavek · October 8, 2019, 2:15pm

Hi,
I have gone through texts like this one: https://www.tidytextmining.com/tfidf.html#term-frequency-in-jane-austens-novels
but I simply need to find a way of listing all English words mentioned in one String variable with exclusions of specified words (like "the") and exclusions of words shorter than 3 characters.
Let's use this simple sample:

data.frame(stringsAsFactors=FALSE,
                              URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii"),
                               E1 = c(1, 2, 3, NA, NA, NA, NA, NA, NA),
                           string = c("book", "my book", "example", "examples", "nothing",
                                      "the end", "a", "v", "bg"),
                               A2 = c(10, 2, 3, 4, 5, 6, 7, 8, 9),
                               B1 = c(3, 9, 10, 1, 2, NA, 9, 6, 7),
                               D1 = c(-1, 10, 6, -1, 8, 9, 7, -1, 99)
                     )

I don't know how the output may look like but I simply need a list like:
book 2
example or examples 2
nothing 1

etc...

Can you help?

mara · October 8, 2019, 2:25pm

Common words, such as "the", are called stop words, and are in the tidytext package (see section 1.3 of the tidytext book to which you've linked above for an introduction to them)

library(tidytext)
data(stop_words)

To get the length of a string (or word) you can use stringr::str_length(). You could combine this with filter() to remove all strings of length three and below.

system · October 29, 2019, 2:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.