Scraping text and tables from a pdf systematically

thryos · October 19, 2023, 12:58pm

Dear community,

I'm trying to scrape some specific text and tables from a metadata pdf report using the pdftools package. The pdf_text() function is quite easy to handle, but I struggle with cleaning the output in order to have a dataframe that only contains the information I want.

To facilitate reproducibility, I created a fake pdf copying the format of my real dataset. Unfortunately, I can't upload it since I'm a new user, but here is the first page:

The challenge is that:

I only want the information in bold
Some variables have different metadata depending on whether they are categorical (tables of values) or numerical (format, minimum, maximum)
Some tables are split between two pages (e.g. the last category of the last table is in the next page)

Although that would be ideal, I do not wish to have the fully coded version answering each problem, but rather bits of answers, leads that would help me overcome these issues. Here is the reprex of this example as imported on R with the pdf_text() function :

library(pdftools)
library(tidyverse)
library(reprex)

pdf_ex <- pdf_text("D:/VARIABLE.pdf")
pdf_ex <- str_split(pdf_ex, "\n", simplify = TRUE)

dput(pdf_ex)
#> structure(c("VARIABLE_NAME1", "03                I don’t know", 
#> "", "", "Description of the variable", "", "", "", "Input variable = ANOTHER_VARIABLE_NAME", 
#> "", "", "     DOCUMENTATION PDF – Table", "List of codes", 
#> "", "", "", "                                           N/A or Missing", 
#> "", " 01                                        First category", 
#> "", " 02                                        Second category", 
#> "", " 03                                        Third category", 
#> "", "", "", "", "", "VARIABLE_NAME2", "", "", "", "Description of the variable", 
#> "", "", "", "Input variable = ANOTHER_VARIABLE_NAME2", "", "", 
#> "", "List of codes", "", "", "", "                                           N/A or Missing", 
#> "", " 01                                        Category 1", 
#> "", " 02                                        Category 2", 
#> "", " 03                                        Category 3", 
#> "", " 04                                        Category 4", 
#> "", " 05                                        Category 5", 
#> "", " 06                                        Category 6", 
#> "", "", "", "", "", "VARIABLE_NAME3", "", "", "", "Description of the variable", 
#> "", "", "", "Input variable = ANOTHER_VARIABLE_NAME3", "", "", 
#> "", "Some random stuff", "", "", "", "Numerical variable", "", 
#> "", "", "Format: Integer", "", "", "", "Minimum : XX", "", "", 
#> "", "Maximum : XXX", "", "", "", "", "", "", "", "VARIABLE_NAME4", 
#> "", "", "", "Description of the variable", "", "", "", "Input variable = ANOTHER_VARIABLE_NAME4", 
#> "", "", "", "                                           N/A or Missing", 
#> "", " 01                                        Yes", "", " 02                                        No", 
#> "", "", "", "", "", "", "", "                              DOCUMENTATION PDF – Table", 
#> "", "", ""), dim = c(2L, 63L))

Please note that I have already the list of variables I want to scrape contained in an object list called variable_name.
Could anyone help me find a scraping strategy based on recurrent patterns happening on the documentation & the variable names I have in my object? I was thinking a matching command that matches any element of my variable_name object with the pdf text.

Thank you all for your help and I hope I presented everything correctly.

thryos · October 19, 2023, 1:08pm

Here is the second page of the example pdf:

system · November 30, 2023, 1:08pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.