Hello,
I am trying to efficiently export data that is locked inside PDF files. I know that there are a few ways of doing this. I am starting off by learning how to use the Tabulizer package.
As an example, here is the PDF that I am using:
During my initial exploration, I tried exporting the table in page 16:
url <- "https://d1io3yog0oux5.cloudfront.net/_2747cc8535dcb78c0cc79115c3968b21/skx/db/437/3006/annual_report/2017_ANNUAL_REPORT.pdf"
test_pg_16 <- extract_tables(url, pages = 16, guess = TRUE, output = "data.frame")
> test_pg_16
[[1]]
X X2016 X.1 X2017 X.2 X2017.1 X.3
1 Domestic stores NA NA NA NA NA
2 Concept ............................................................................... NA 117 NA 5 NA (5)
3 Factory Outlet ..................................................................... NA 163 NA 7 NA —
4 Warehouse Outlet ............................................................... NA 133 NA 29 NA —
5 Domestic stores total .......................................................... NA 413 NA 41 NA (5)
6 International stores NA NA NA NA NA
7 Concept ............................................................................... NA 101 NA 19 NA —
8 Factory Outlet ..................................................................... NA 51 NA 16 NA —
9 Warehouse Outlet ............................................................... NA 5 NA 4 NA —
10 International stores total ..................................................... NA 157 NA 39 NA —
The output is obviously quite messy, but I think I will be able to find some way to clean up most of it via some data wrangling in dplyr.
My key question: what are my options if the data is embedded inside the PDF as an image?
For example, page 89 of this annual report looks like this:
test_pg_89 <- extract_tables(url, pages = 89, guess = TRUE, output = "data.frame")
# this comes up empty
pg_89_text_only <- extract_text(url, page = 89)
# this also comes up empty
I realize that this may be a fairly complex problem. I am guessing one would have to extract the images first, and then find a way to read the text on those images via some sort of character recognition process. But I am working on a project that involves a lot of PDF files and a large number of them have stored the data as images embedded inside the PDF.
Any thoughts or pointers to resources would be very helpful. Thanks in advance.