Import/Convert pdf files in Quanteda

I would like to analyse pdf files in RStudio with the Quanteda package.

I tried several options to convert pdf files in .rda extension but the procedures I followed did not work and seem to be quite intricate. Thus I kindly wanted to ask you if you know how such a conversion can be performed.

For pdf with text as text (i.e. not scans) I have had success with {pdftools}. For scans & OCR I was told that {tesseract} is a good choice but I have not tried it personally.

This is a sample of my workflow when using the package:


asdf <- pdf_text("path-to-yer-document.pdf") # read the file in / as a list of pages
res <- "" # global init

for (i in seq_along(asdf)) { 
  res <- paste0(res, asdf[i]) # paste individual pages together

res <- str_replace_all(res, "\n", " ") # replace newlines with spaces
res <- str_replace_all(res, "\\s+", " ") # replace multiple spaces with a single one

print(res) # look what the cat has brought in!

It works! Thank you.

Glad to be of service! Text mining is exciting stuff...

