Extract multiple tables from pdf which are not in same format

sunil_451 · December 18, 2018, 10:24am

Hi ,

I have PDF file and it has multiple tables which are not in same format.this file is generating from XML and I want to compare XML values to PDF values which are source to PDF .

I have done so many exercises by using tabulizer packages but it couldn’t resolve my issue.could you please help me any one on this

Thanks & Regards,

Sunil

EconomiCurtis · December 18, 2018, 10:34am

A good way to get help with this sort of thing is to pose an example questions as a reprex (or as close to a reproducible example as possible). FAQ: Tips for writing R-related questions.

It's not totally clear to me the problem that you're having, but there are a number of R packages that may help. These are all in the aim of getting your pdf and xml data into something you can work with in R.

pdftools, for extracting text, fonts, attachments and metadata from a PDF file

xml2 is a handy tidyverse package for working with HTML and XML from R

sunil_451 · December 18, 2018, 11:11am

PDF file is report and this PDF is generating from XML file.so i want to compare XML values and PDF values like source to target comparison. is there any way to load multiple tables which are different from PDF to CSV file format.

I am able to convert xml file into csv so if I can convert pdf also into csv then we can compare two csv files using R

example:

Just say in one PDF we have 4 tables

Customers

Customer_id customer_name customer_product customer_dept

Product:

Product_id product_name product_cost

Dept

Dept_id dept_name dept_loc

Country

Country_id country_name country_address

The above tables are in single pdf now my sceniory is I want load all 4 tables in one csv format

Note : all 4 tables attarabutes are different and not in same structure

system · January 8, 2019, 11:16am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.