I'm currently using R as part of a research project and need to do a "word count" on a list of words contained within PDF files.
As a new R user, I spent a week searching YouTube and Google tutorials to learn and try many different codes. I had thought I finally cracked it but it seems there remains some inconsistency issues, where the code has trouble picking up "2 words" (those with spaces between them) in some of the PDF files. I assume this has something to do with the cleaning part of the code.
Can anyone please help?! The code I’m currently using is as follows:
The short answer is that the object created by TermDocumentMatrix is tokenized—that is divided into single strings along spaces. So, the object doesn't contain phrases.
library(tm)
#> Loading required package: NLP
data("acq")
all.tdm <- TermDocumentMatrix(acq,control = list(stopwords = TRUE,
tolower = TRUE,
stem = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(1, Inf))))
inspect(all.tdm[c("buy"),])
#> <<TermDocumentMatrix (terms: 1, documents: 50)>>
#> Non-/sparse entries: 11/39
#> Sparsity : 78%
#> Maximal term length: 3
#> Weighting : term frequency (tf)
#> Sample :
#> Docs
#> Terms 10 135 153 157 186 331 366 372 387 408
#> buy 1 1 1 1 1 1 2 3 1 1
inspect(all.tdm[c("four seasons"),])
#> Error in `[.simple_triplet_matrix`(all.tdm, c("four seasons"), ): Subscript out of bounds.
# one of the headlines refers to "FOUR SEASONS"
# each word appears separately in the object
all.tdm$dimnames$Terms[696:700]
#> [1] "four" "fourth" "fraction" "frederik" "free"
all.tdm$dimnames$Terms[1450:1455]
#> [1] "scientific" "sealy" "sealy." "seasons" "sec"
#> [6] "second"
Phrases, such a "Four Seasons" in the example need a different approach. I'll take a look and let you have some ideas. The hard work of extracting the words, however, from the pdf files is done.
I think I understood the first part but not the second part of your answer. So what you're saying is that TermDocumentMatrix is what prevents my code from finding phrases with a "space" between them (aka 2+ words)?
As for the second half of your code, the very bottom, what would that be? (all.tdm$dimnames$Terms[696:700]
TermDocumentMatrix is the result of splitting the text stream into "tokens", using space as separators. The snippet all.tdm$dimnames$Terms[696:700] shows the word "four" and the other shows the word "seasons." Both words are in the matrix but they can only be fetched singly, not as a pair "four seasons"
Actually, maybe a more appropriate question would be: is TermDocumentMatrix necessary? It seems like it's the reason as to why the code is unable to pick up those multiple words because its using space as separators. Is there another way in which I could write that "cleaning" part of the code (list(stopwords = TRUE, tolower = TRUE, etc....) without TermDocumentMatrix?
In this case, not rows, just position in the vector of words. To do a search for a phrase is an n-gram as shown in the following snippet from Text Mining with R.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
library(janeaustenr)
austen_bigrams <- austen_books() %>%
unnest_tokens(bigram, text, token = "ngrams", n = 3)
austen_bigrams[which(austen_bigrams$bigram == "ten thousand pounds"),]
#> # A tibble: 7 × 2
#> book bigram
#> <fct> <chr>
#> 1 Sense & Sensibility ten thousand pounds
#> 2 Sense & Sensibility ten thousand pounds
#> 3 Pride & Prejudice ten thousand pounds
#> 4 Pride & Prejudice ten thousand pounds
#> 5 Pride & Prejudice ten thousand pounds
#> 6 Pride & Prejudice ten thousand pounds
#> 7 Persuasion ten thousand pounds
I haven't looked in a while to know what the equivalent is in other packages, and don't know what the structure of the object containing the results of phrases searches is desired, so I'll leave it at that.