Cleaning and working with PDF documents

AdamCU · December 31, 2021, 10:22pm

I'm currently using R as part of a research project and need to do a "word count" on a list of words contained within PDF files.

As a new R user, I spent a week searching YouTube and Google tutorials to learn and try many different codes. I had thought I finally cracked it but it seems there remains some inconsistency issues, where the code has trouble picking up "2 words" (those with spaces between them) in some of the PDF files. I assume this has something to do with the cleaning part of the code.

Can anyone please help?! The code I’m currently using is as follows:

library(pdftools)
library(tm)
library(dplyr)
library(tidytext)
library(readr)
library(ggplot2)
library(quanteda)

files = list.files(pattern = "pdf$")
files
all=lapply(files, pdf_text)
lapply(all, length)
pdfdatabase=Corpus(URISource(files),readerControl = list(reader = readPDF))
pdfdatabase
pdfdatabase= tm_map(pdfdatabase, removePunctuation, ucp = TRUE)
all.tdm=TermDocumentMatrix(pdfdatabase,control = list(stopwords = TRUE,
tolower = TRUE,
stem = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(1, Inf))))

inspect(all.tdm[c("encryption"),])
inspect(all.tdm[c("computer virus"),])
inspect(all.tdm[c("information security"),])
inspect(all.tdm[c("computer security"),])
inspect(all.tdm[c("hacking"),])
inspect(all.tdm[c("hacker"),])
inspect(all.tdm[c("denial of service"),])
inspect(all.tdm[c("cyber-attack"),])

technocrat · January 1, 2022, 12:37am

The short answer is that the object created by TermDocumentMatrix is tokenized—that is divided into single strings along spaces. So, the object doesn't contain phrases.

library(tm)
#> Loading required package: NLP
data("acq")
all.tdm <- TermDocumentMatrix(acq,control = list(stopwords = TRUE,
                            tolower = TRUE,
                            stem = TRUE,
                            removeNumbers = TRUE,
                            bounds = list(global = c(1, Inf))))
inspect(all.tdm[c("buy"),])
#> <<TermDocumentMatrix (terms: 1, documents: 50)>>
#> Non-/sparse entries: 11/39
#> Sparsity           : 78%
#> Maximal term length: 3
#> Weighting          : term frequency (tf)
#> Sample             :
#>      Docs
#> Terms 10 135 153 157 186 331 366 372 387 408
#>   buy  1   1   1   1   1   1   2   3   1   1
inspect(all.tdm[c("four seasons"),])
#> Error in `[.simple_triplet_matrix`(all.tdm, c("four seasons"), ): Subscript out of bounds.

# one of the headlines refers to "FOUR SEASONS"
# each word appears separately in the object
all.tdm$dimnames$Terms[696:700]
#> [1] "four"     "fourth"   "fraction" "frederik" "free"
all.tdm$dimnames$Terms[1450:1455]
#> [1] "scientific" "sealy"      "sealy."     "seasons"    "sec"       
#> [6] "second"

Phrases, such a "Four Seasons" in the example need a different approach. I'll take a look and let you have some ideas. The hard work of extracting the words, however, from the pdf files is done.

AdamCU · January 1, 2022, 5:10am

Thank you very much for your time and answer!

I think I understood the first part but not the second part of your answer. So what you're saying is that TermDocumentMatrix is what prevents my code from finding phrases with a "space" between them (aka 2+ words)?

As for the second half of your code, the very bottom, what would that be? (all.tdm$dimnames$Terms[696:700]

technocrat · January 1, 2022, 5:45am

TermDocumentMatrix is the result of splitting the text stream into "tokens", using space as separators. The snippet all.tdm$dimnames$Terms[696:700] shows the word "four" and the other shows the word "seasons." Both words are in the matrix but they can only be fetched singly, not as a pair "four seasons"

AdamCU · January 1, 2022, 5:50am

Ah, ok I'm understanding this now! I assume the numbers in the brackets [696:700] are the rows from which the snippet will retrieve the words?

AdamCU · January 1, 2022, 5:53am

Actually, maybe a more appropriate question would be: is TermDocumentMatrix necessary? It seems like it's the reason as to why the code is unable to pick up those multiple words because its using space as separators. Is there another way in which I could write that "cleaning" part of the code (list(stopwords = TRUE, tolower = TRUE, etc....) without TermDocumentMatrix?

technocrat · January 1, 2022, 5:55am

In this case, not rows, just position in the vector of words. To do a search for a phrase is an n-gram as shown in the following snippet from Text Mining with R.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidytext)
library(janeaustenr)

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 3)

austen_bigrams[which(austen_bigrams$bigram == "ten thousand pounds"),]
#> # A tibble: 7 × 2
#>   book                bigram             
#>   <fct>               <chr>              
#> 1 Sense & Sensibility ten thousand pounds
#> 2 Sense & Sensibility ten thousand pounds
#> 3 Pride & Prejudice   ten thousand pounds
#> 4 Pride & Prejudice   ten thousand pounds
#> 5 Pride & Prejudice   ten thousand pounds
#> 6 Pride & Prejudice   ten thousand pounds
#> 7 Persuasion          ten thousand pounds

I haven't looked in a while to know what the equivalent is in other packages, and don't know what the structure of the object containing the results of phrases searches is desired, so I'll leave it at that.

AdamCU · January 1, 2022, 6:00am

I see, thank you once again for all your time and help!

system · January 22, 2022, 6:01am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.