Text extraction from PDF with search criteria

Rafn · June 18, 2020, 7:35am

I need to extract some text from a 400 page report and i hope you can help.

So far i have been able to extract text which appear the same amount in the PDF. Now i have to extract text bit which only appear sometimes and pair it with the other text.
The standard setup in the PDF looks like this:

Schema element: Single word

Field type / facets: Multiple

lines

Properties: Single line

Guidance on completion of schema element: Alot of text

in multiple lines

Quality checks: This is the field that only appears sometimes, but it needs to linked to the above fields if it appears.

So far my codes looks like this:

library(pdftools)
library(pdfsearch)
library(tidyverse)
library(xlsx)

keywords <- pdf_text("FINAL Draft4_WFD_Reporting_Guidance_2022_resource_page_JERAF.pdf")
keywords <- keywords %>% strsplit("Schema element:") 
keywords <- keywords %>% lapply(function(x) x[-1]) 
keywords <- keywords %>% lapply(function(x) sapply(strsplit(x, "\r\n"), `[`, 1))
keywords <- keywords %>% unlist 
keywords <- keywords %>% trimws()


text <- pdf_text("FINAL Draft4_WFD_Reporting_Guidance_2022_resource_page_JERAF.pdf")
text <- text %>% strsplit("Guidance on completion of schema element:|Guidance:|Guidance on completion of schema:")
text <- text %>% lapply(function(x) x[-1]) 
text <- text %>% lapply(function(x) sapply(strsplit(x, ":"), `[`, 1))
text <- text[lapply(text,length)>0]
text <- text %>% lapply(function(x) sapply(strsplit(x, "\r\n"), 
                           function(y) paste(y[-length(y)], collapse = ""))) 
text <- text %>% unlist() 
text <- text %>% {gsub("  ", " ", .)} 
text <- text %>% trimws() 
text <- text %>% sapply(`[`, 1)

quality <- pdf_text("FINAL Draft4_WFD_Reporting_Guidance_2022_resource_page_JERAF.pdf")
quality <- quality %>% strsplit("Quality checks:|Quality check")
quality <- quality %>% lapply(function(x) x[-1]) 
quality <- quality %>% lapply(function(x) sapply(strsplit(x, "Schema element:"), `[`, 1))
quality <- lapply(quality, function(x) if (length(x) == 0) {"Tom"} else {x})
quality <- quality %>% lapply(function(x) sapply(strsplit(x, "\r\n"), 
                                           function(y) paste(y[-length(y)], collapse = ""))) 
quality <- quality %>% unlist() 
quality <- quality %>% {gsub("  ", " ", .)} 
quality <- quality %>% trimws() 
quality <- quality %>% sapply(`[`, 1)

df <- tibble(keywords, text, quality) # This doesn't work because quality is shorter than the rest.

I would like a df with columns (Schema element, Field type, Properties, Guidance, quality) and with the text as values. With the word "empty" os the fields doens't exist in the pdf. Similar to:

> df
# A tibble: 570 x 2
   keywords                 text                                                                                         
   <chr>                    <chr>                                                                                        
 1 countryCode              Required. Two-letter ISO country code10.                                                     
 2 euRBDCode                Required (except in the RBDSUCA file). Unique EU code  of the River Basin District. Prefix t~
 3 created                  Optional. Date of creation of the dataset.                                                   
 4 creatorElectronicMailAd~ Required. E-mail address of the point of contact in the  organisation responsible for the da~
 5 creatorOrganisationName  Required. Name of the organisation doing the  reporting.                                     
 6 description              Optional. Description of the dataset.                                                        
 7 language                 Required. Code of the language of the dataset.                                               
 8 license                  Required. A legal document giving official permission  to do something with the resource. Pr~
 9 title                    Optional. Name given to the dataset.                                                         
10 rights                   Optional. Information about rights held in and over the  resource. If necessary, provide the~
# ... with 560 more rows

The pdf is linked below

Hopefully you can help

system · July 9, 2020, 7:35am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.