Hello i am very new to R, i am looking at how to convert a PDF to Excel. is there any example code and i can try?
This is a tougher task than it might seem, since PDF encoding is very complicated and can't always be extracted with the same spatial relations we perceive. For instance, copy-pasting from a PDF table often yields garbage.
Here's a blog post walking through one way:
If you've converted the data to image (eg using imagemagick
) you could then perform OCR with Tesseract:
https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
thank you, i have managed to convert the PDF into images and output a CSV, how would i go about formating this CSV. e.g. separate the spaces into cells
here my code so far
library(tesseract)
library(pdftools)
# Render pdf to png image
img_file <- pdftools::pdf_convert("filepath/test.pdf", format = 'tiff', dpi = 400)
# Extract text from png image
text <- ocr(img_file)
writeLines(text, "filepath/mydata.csv")
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.