I need to extract some text from a 400 page report and i hope you can help.
So far i have been able to extract text which appear the same amount in the PDF. Now i have to extract text bit which only appear sometimes and pair it with the other text.
The standard setup in the PDF looks like this:
Schema element: Single word
Field type / facets: Multiple
lines
Properties: Single line
Guidance on completion of schema element: Alot of text
in multiple lines
Quality checks: This is the field that only appears sometimes, but it needs to linked to the above fields if it appears.
So far my codes looks like this:
library(pdftools)
library(pdfsearch)
library(tidyverse)
library(xlsx)
keywords <- pdf_text("FINAL Draft4_WFD_Reporting_Guidance_2022_resource_page_JERAF.pdf")
keywords <- keywords %>% strsplit("Schema element:")
keywords <- keywords %>% lapply(function(x) x[-1])
keywords <- keywords %>% lapply(function(x) sapply(strsplit(x, "\r\n"), `[`, 1))
keywords <- keywords %>% unlist
keywords <- keywords %>% trimws()
text <- pdf_text("FINAL Draft4_WFD_Reporting_Guidance_2022_resource_page_JERAF.pdf")
text <- text %>% strsplit("Guidance on completion of schema element:|Guidance:|Guidance on completion of schema:")
text <- text %>% lapply(function(x) x[-1])
text <- text %>% lapply(function(x) sapply(strsplit(x, ":"), `[`, 1))
text <- text[lapply(text,length)>0]
text <- text %>% lapply(function(x) sapply(strsplit(x, "\r\n"),
function(y) paste(y[-length(y)], collapse = "")))
text <- text %>% unlist()
text <- text %>% {gsub(" ", " ", .)}
text <- text %>% trimws()
text <- text %>% sapply(`[`, 1)
quality <- pdf_text("FINAL Draft4_WFD_Reporting_Guidance_2022_resource_page_JERAF.pdf")
quality <- quality %>% strsplit("Quality checks:|Quality check")
quality <- quality %>% lapply(function(x) x[-1])
quality <- quality %>% lapply(function(x) sapply(strsplit(x, "Schema element:"), `[`, 1))
quality <- lapply(quality, function(x) if (length(x) == 0) {"Tom"} else {x})
quality <- quality %>% lapply(function(x) sapply(strsplit(x, "\r\n"),
function(y) paste(y[-length(y)], collapse = "")))
quality <- quality %>% unlist()
quality <- quality %>% {gsub(" ", " ", .)}
quality <- quality %>% trimws()
quality <- quality %>% sapply(`[`, 1)
df <- tibble(keywords, text, quality) # This doesn't work because quality is shorter than the rest.
I would like a df with columns (Schema element, Field type, Properties, Guidance, quality) and with the text as values. With the word "empty" os the fields doens't exist in the pdf. Similar to:
> df
# A tibble: 570 x 2
keywords text
<chr> <chr>
1 countryCode Required. Two-letter ISO country code10.
2 euRBDCode Required (except in the RBDSUCA file). Unique EU code of the River Basin District. Prefix t~
3 created Optional. Date of creation of the dataset.
4 creatorElectronicMailAd~ Required. E-mail address of the point of contact in the organisation responsible for the da~
5 creatorOrganisationName Required. Name of the organisation doing the reporting.
6 description Optional. Description of the dataset.
7 language Required. Code of the language of the dataset.
8 license Required. A legal document giving official permission to do something with the resource. Pr~
9 title Optional. Name given to the dataset.
10 rights Optional. Information about rights held in and over the resource. If necessary, provide the~
# ... with 560 more rows
The pdf is linked below
Hopefully you can help