Convert PDF to an Excel or CSV

NewbieJono · March 18, 2022, 6:27pm

Hello i am very new to R, i am looking at how to convert a PDF to Excel. is there any example code and i can try?

jonspring · March 18, 2022, 6:57pm

This is a tougher task than it might seem, since PDF encoding is very complicated and can't always be extracted with the same spatial relations we perceive. For instance, copy-pasting from a PDF table often yields garbage.

Here's a blog post walking through one way:

If you've converted the data to image (eg using imagemagick) you could then perform OCR with Tesseract:

https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html

NewbieJono · March 18, 2022, 9:04pm

thank you, i have managed to convert the PDF into images and output a CSV, how would i go about formating this CSV. e.g. separate the spaces into cells

here my code so far

library(tesseract)
library(pdftools)

# Render pdf to png image

img_file <- pdftools::pdf_convert("filepath/test.pdf", format = 'tiff',  dpi = 400)

# Extract text from png image
text <- ocr(img_file)
writeLines(text, "filepath/mydata.csv")

system · April 8, 2022, 9:05pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.