Extracting blocks of PDF text and combine into a data.frame - using regex

Beavet82 · July 6, 2023, 10:47am

Hi all,

I hope someone could help me with this task.

I need to extract from a series of PDF files data contained in sections.
This is the first time I work with PDF text, and after reading on library(pdftools) and regex (of which I never heard before) and checking different examples around, I am going in circles as yet I don't have enough knowledge to get what I need from those examples.

These are some examples of PDF documents that I need to use.

I have tried to download them but they appear with no text, so i opted for saving them in the working directory instead, and attached the file to this message.

Leucofeligen.pdf (154.2 KB)
Pentofel.pdf (207.9 KB)

What I tried to do under GetData is to extract the text following each of the headings of interest. SO far I have detected two issues (although they may be more):

The headings are repeated in some cases, so multiple entries are generated, while I only want one.
I don't know how to say what is the end of the text I am interested in. I have used the examples from other regex coding, but they don't work properly.

The ultimate goal is to generate a dataframe with columns as labelled under the GetData function (Name, Composition, target_species, etc.).

I hope the message is clear enough to get some tips...
Many thanks in advance
Beatriz

This is the code I sued in my session with the file I generated after out. Sorry the reprex is underneath but is not working for some reason.

library(pdftools)
#download.file("https://medicines.health.europa.eu/veterinary/en/documents/download/e7da093e-0dc2-4120-8fa1-4beb29cc2a90","Pentofelb.pdf")

#download.file("https://medicines.health.europa.eu/veterinary/en/documents/download/e7da093e-0dc2-4120-8fa1-4beb29cc2a90", "Leucofeligen.pdf")

file.list <- list.files(pattern = "pdf|PDF$") 
x <- map(file.list, ~ pdf_text(.))
names(x) <- gsub("\\.pdf", "", file.list)


GetData <- function(x){
  list(Name  = str_trim(str_extract(x, "(?<=NAME OF THE VETERINARY MEDICINAL PRODUCT)[^-]+")),
       Compositione = str_trim(str_extract(x, "(?<=QUALITATIVE AND QUANTITATIVE COMPOSITION)[^-]+")),
       Target_species = str_trim(str_extract(x, "(?<=Target species)[^-]+")),
       Indications = str_trim(str_extract(x, "(?<=Indications for use, specifying the target species)[^-]+")),
       Contraindications = str_trim(str_extract(x, "(?<=Contraindications)[^-]+")))
} 

Out <- map_dfr(x, GetData)

Out

The reprex file is this one underneath but is not working - Not sure why is not finding the source documents on my working directory.

#Example

#Download the PDF files.

#download.file("https://medicines.health.europa.eu/veterinary/en/documents/download/e7da093e-0dc2-4120-8fa1-4beb29cc2a90","Pentofelb.pdf")

#download.file("https://medicines.health.europa.eu/veterinary/en/documents/download/e7da093e-0dc2-4120-8fa1-4beb29cc2a90", "Leucofeligen.pdf")

library(pdftools)
#> Warning: package 'pdftools' was built under R version 4.2.3
#> Using poppler version 22.04.0



file.list <- list.files(pattern = "pdf|PDF$") 
x <- map(file.list, ~ pdf_text(.))
#> Error in map(file.list, ~pdf_text(.)): could not find function "map"
names(x) <- gsub("\\.pdf", "", file.list)
#> Error in names(x) <- gsub("\\.pdf", "", file.list): object 'x' not found


GetData <- function(x){
  list(Name  = str_trim(str_extract(x, "(?<=NAME OF THE VETERINARY MEDICINAL PRODUCT)[^-]+")),
       Compositione = str_trim(str_extract(x, "(?<=QUALITATIVE AND QUANTITATIVE COMPOSITION)[^-]+")),
       Target_species = str_trim(str_extract(x, "(?<=Target species)[^-]+")),
       Indications = str_trim(str_extract(x, "(?<=Indications for use, specifying the target species)[^-]+")),
       Contraindications = str_trim(str_extract(x, "(?<=Contraindications)[^-]+")))
} 

Out <- map_dfr(x, GetData)
#> Error in map_dfr(x, GetData): could not find function "map_dfr"

Out
#> Error in eval(expr, envir, enclos): object 'Out' not found

^{Created on 2023-07-06 with reprex v2.0.2}

AlexisW · July 11, 2023, 3:29am

Reading this kind of data from a pdf is not an easy task, there can be many formatting problems that come in the way. One question is whether you need something "good enough", that can be verified by a human, or if you need something very good, which will not make mistakes without human supervision.

For a high-quality parser, you'd probably need a lot more code, reading each line one by one and making sure that the title numbering is consistent etc.

If you just need something "mostly good", then using regex is a good approach. The regexes you used are with lookahead/lookbehind, which is a very advanced use of regex, personally I find them confusing and try to avoid them, so I will only use simple regexes here.

Let's start with a single file:

file_name <- file.list[[1]]

content_by_page <- pdf_text(file_name)

One problem is that currently you get a character vector where each element is a page, but in this pdf page numbers are not very useful, we'd prefer to have each line as an element.

content_single_string <- str_c(content_by_page,
                               collapse = "\n")
content_by_line <- str_split_1(content_single_string,
                             pattern = "\n")

Now let's start with the regex! We will try to identify the structure of the text, finding the titles. For that we simply use the fact that these titles start with a number and are in capital or title case:

is_annex_title <- str_detect(content_by_line,
                             pattern = "^ *ANNEX [IVX]+ *$")
is_section_title <- str_detect(content_by_line,
                               pattern = "^[0-9]+\\. +[A-Z ]+$")
is_subsection_title <- str_detect(content_by_line,
                                  pattern = "^[0-9]+\\.[0-9]+ +[A-Z][a-z ,()]+$")

is_title <- is_annex_title| is_section_title | is_subsection_title

We can manually check our result (uncomment the filter() to make it more readable):

tibble(is_title = is_title,
       cont = content_by_line) |>
  # filter(is_title) |>
  View()

note we don't capture the ANNEX II section titles which use A, B, C numbering
note also since we split by line, long titles like 6.6 are truncated

So, we have our text and its structure (will be useful later). You mention the same title appears multiple times, so we need some kind of criterion to know which one we're interested in. Let's say we're only interested in ANNEX I:

start_annex_of_interest <- which(str_trim(content_by_line) == "ANNEX I") + 1
end_annex_of_interest <- which(str_trim(content_by_line) == "ANNEX II") - 1

content_of_interest <- content_by_line[start_annex_of_interest:end_annex_of_interest]

Now, we can look for the desired information only in (sub)section titles

start_section_of_interest <- which(str_detect(content_of_interest,
                 pattern = "^[0-9]+\\. +NAME OF THE VETERINARY MEDICINAL PRODUCT$")) + 1

so we know where it start. To find out where it ends, we will look for the next title:

end_section_of_interest <- which(is_title)[min(which(which(is_title) > start_section_of_interest))] - 1

content_of_interest[start_section_of_interest:end_section_of_interest] |>
  str_c(collapse = " ") |>
  str_trim()

And we just detected the Name!

Now you can put the code to detect a section of interest in a function, so you can easily call it for Compositione, Target_species, etc, and then you can put all that code in a function to call on each document.

Because pdfs are complex documents, you probably want to review the results manually, I probably forgot many special cases!

Beavet82 · July 17, 2023, 4:57pm

Thanks a lot @AlexisW , just came from holidays and you made my day being able to progress with this task!

For the purpose of this task, the "mostly good approach" is what we need - the data obtained will need to be verified by us and queried.

So far, to get to test your code, I have done a small extraction of the data that I needed , and I have been able to replicate your suggested code (making some changes)! which is great!

I will take the advantage of your kind reply to ask a bit more advice from you:

You suggested to start with a single file.
I have done so in the test and works well. The final idea is to be able to extract from different PDF files the information which is found under each of the words/sections of interest. Is this possible?
I was wondering if using functions is the way forwards, but functions is also something I have been avoiding for years.

Could a function be used, like the one in my previous post (GetData <- ...), but changing the lookahead/lookbehind regex for your coding, be suitable?

GetData <- function(x){
  data.frame(NAME  = str_trim(str_extract(x, "(?<=NAME OF THE VETERINARY MEDICINAL PRODUCT)[^-]+")),
       Formulation = str_trim(str_extract(x, "(?<=PHARMACEUTICAL FORM)[^-]+")),
       DOI = str_trim(str_extract(x, "(?<=Duration of immunity:)[^\\n]+")))
}
Out <- map_dfr(x, GetData)
#> Error in map_dfr(x, GetData): could not find function "map_dfr"
Out
#> Error in eval(expr, envir, enclos): object 'Out' not found

^{Created on 2023-07-17 with reprex v2.0.2}

Underneath is the code I have tested so far - says again is not working, i believe because it cannot get the original PDFs.
I have created a data.frame and exported it to excel, but it only contains the values from the first PDF. I will have many PDFs to extract the information from, and thinking on doing it one by one will be an unrealistic task.
Ignore my comments in the example as is only for my understanding but the wording used may not be very accurate.

Thanks a lot!!

library(pdftools)
#> Warning: package 'pdftools' was built under R version 4.2.3
#> Using poppler version 22.04.0
library(stringi)
library(openxlsx)

##### Import the PDFs and prepare them in the right format:----

SPC.list <- list.files(pattern = "pdf|PDF$") # imports all PDF files from the data folder.

SPC.list <- map(SPC.list, ~ pdf_text(.)) # format all imported files with the structure of the PDF. In this example generates a LIST of two characters, one for each PDF file in this case.
#> Error in map(SPC.list, ~pdf_text(.)): could not find function "map"
str(SPC.list)
#>  chr(0)

productA <- SPC.list[[1]] # Code to take the first file [1] of the vector called "SPC.list". This is already in PDF format after applying the pdf_text(.) above.
#> Error in SPC.list[[1]]: subscript out of bounds


productA_single_string<-stri_paste(unlist(productA), collapse="\n") # Using the library 'stringi', the 'stri_paste()' function merges the vector elements into strings after converting the LIST to a vector with 'unlist()'. 
#> Error in unlist(productA): object 'productA' not found
str(productA_single_string) # Gives same result as using the code:  Leucofeligen_single_string <- str_c(Leucofeligen_by_page, collapse = "\n")
#> Error in str(productA_single_string): object 'productA_single_string' not found


productA_single_by_line <- str_split_1(productA_single_string,
                                       pattern = "\n") # Code to make each line as an element.
#> Error in str_split_1(productA_single_string, pattern = "\n"): could not find function "str_split_1"



##### Select ANNEX I only as the source of the SPC information:----

start_annex_of_interest <- which(str_trim(productA_single_by_line) == "ANNEX I") + 1
#> Error in str_trim(productA_single_by_line): could not find function "str_trim"
end_annex_of_interest <- which(str_trim(productA_single_by_line) == "ANNEX II") - 1
#> Error in str_trim(productA_single_by_line): could not find function "str_trim"
content_of_interest <- productA_single_by_line[start_annex_of_interest:end_annex_of_interest]
#> Error in eval(expr, envir, enclos): object 'productA_single_by_line' not found


##### Identify the structure of the text, to find the titles:----

is_annex_title <- str_detect(content_of_interest,
                             pattern = "^ *ANNEX [IVX]+ *$")
#> Error in str_detect(content_of_interest, pattern = "^ *ANNEX [IVX]+ *$"): could not find function "str_detect"
is_section_title <- str_detect(content_of_interest,
                               pattern = "^[0-9]+\\. +[A-Z ]+$")
#> Error in str_detect(content_of_interest, pattern = "^[0-9]+\\. +[A-Z ]+$"): could not find function "str_detect"
is_subsection_title <- str_detect(content_of_interest,
                                  pattern = "^[0-9]+\\.[0-9]+ +[A-Z][a-z ,()]+$")
#> Error in str_detect(content_of_interest, pattern = "^[0-9]+\\.[0-9]+ +[A-Z][a-z ,()]+$"): could not find function "str_detect"

is_title <- is_annex_title| is_section_title | is_subsection_title
#> Error in eval(expr, envir, enclos): object 'is_annex_title' not found

tibble(is_title = is_title,
       cont = content_of_interest) |>
  # filter(is_title) |>
  View()
#> Error in tibble(is_title = is_title, cont = content_of_interest): could not find function "tibble"

##### Select each of the section of interest (SOI) to create a table:----

### Name of product: ---
start_SOI_name <- which(str_detect(content_of_interest,
                                   pattern = "^[0-9]+\\. +NAME OF THE VETERINARY MEDICINAL PRODUCT$")) + 1
#> Error in str_detect(content_of_interest, pattern = "^[0-9]+\\. +NAME OF THE VETERINARY MEDICINAL PRODUCT$"): could not find function "str_detect"

end_SOI_name <- which(is_title)[min(which(which(is_title) > start_SOI_name))] - 1
#> Error in which(is_title): object 'is_title' not found

Name_product <- content_of_interest[start_SOI_name:end_SOI_name] |>
  str_c(collapse = " ") |>
  str_trim()
#> Error in str_trim(str_c(content_of_interest[start_SOI_name:end_SOI_name], : could not find function "str_trim"

### QUALITATIVE AND QUANTITATIVE COMPOSITION: ---
start_SOI_composition <- which(str_detect(content_of_interest,
                                          pattern = "^[0-9]+\\. +QUALITATIVE AND QUANTITATIVE COMPOSITION$")) + 1
#> Error in str_detect(content_of_interest, pattern = "^[0-9]+\\. +QUALITATIVE AND QUANTITATIVE COMPOSITION$"): could not find function "str_detect"

end_SOI_composition <- which(is_title)[min(which(which(is_title) > start_SOI_composition))] - 1
#> Error in which(is_title): object 'is_title' not found

Composition_product <- content_of_interest[start_SOI_composition:end_SOI_composition] |>
  str_c(collapse = " ") |>
  str_trim()
#> Error in str_trim(str_c(content_of_interest[start_SOI_composition:end_SOI_composition], : could not find function "str_trim"

str(Composition_product)
#> Error in str(Composition_product): object 'Composition_product' not found

### Target species: ---
start_SOI_target <- which(str_detect(content_of_interest,
                                     pattern = "^[0-9]+\\.[0-9]+ +Target species+$")) + 1
#> Error in str_detect(content_of_interest, pattern = "^[0-9]+\\.[0-9]+ +Target species+$"): could not find function "str_detect"

end_SOI_target <- which(is_title)[min(which(which(is_title) > start_SOI_target))] - 1
#> Error in which(is_title): object 'is_title' not found

Target_spp_product <- content_of_interest[start_SOI_target:end_SOI_target] |>
  str_c(collapse = " ") |>
  str_trim()
#> Error in str_trim(str_c(content_of_interest[start_SOI_target:end_SOI_target], : could not find function "str_trim"

### Indications for use: ---
start_SOI_indications <- which(str_detect(content_of_interest,
                                          pattern = "^[0-9]+\\.[0-9]+ +Indications for use+")) + 1
#> Error in str_detect(content_of_interest, pattern = "^[0-9]+\\.[0-9]+ +Indications for use+"): could not find function "str_detect"

end_SOI_indications <- which(is_title)[min(which(which(is_title) > start_SOI_indications))] - 1
#> Error in which(is_title): object 'is_title' not found

indications_product <- content_of_interest[start_SOI_indications:end_SOI_indications] |>
  str_c(collapse = " ") |>
  str_trim()
#> Error in str_trim(str_c(content_of_interest[start_SOI_indications:end_SOI_indications], : could not find function "str_trim"


### OOI: ---
start_SOI_OOI <- which(str_detect(content_of_interest,
                                          pattern = "onset of immunity+"))
#> Error in str_detect(content_of_interest, pattern = "onset of immunity+"): could not find function "str_detect"

end_SOI_OOI <- which(is_title)[min(which(which(is_title) > start_SOI_OOI))] - 1
#> Error in which(is_title): object 'is_title' not found
OOI_product <- content_of_interest[start_SOI_OOI:end_SOI_OOI] |>
  str_c(collapse = " ") |>
  str_trim()
#> Error in str_trim(str_c(content_of_interest[start_SOI_OOI:end_SOI_OOI], : could not find function "str_trim"

GetData <- data.frame(NAME  = Name_product,
                Formulation = Composition_product,
                Target_species = Target_spp_product,
                Indications_use = indications_product,
                OOI = OOI_product)
#> Error in data.frame(NAME = Name_product, Formulation = Composition_product, : object 'Name_product' not found

GetData
#> Error in eval(expr, envir, enclos): object 'GetData' not found
write.xlsx(GetData,'test_file.xlsx', sheetName="Sheet1")
#> Error in buildWorkbook(x, asTable = asTable, ...): object 'GetData' not found

^{Created on 2023-07-17 with reprex v2.0.2}

AlexisW · July 19, 2023, 12:13am

Yes, that was my idea. I know, writing your own function can be intimidating at first, but once you get used to it it's not that hard.

The good news, functions are not necessary, anything that can be done with a function can also be done with a for loop, if that's more intuitive to you.

Yes, except you would use the code that works for a single file as the content of the function or the for loop.

There is a solution to that, it's a bit subtle, but you have to specify that you want to download the pdf as a binary:

download.file("https://medicines.health.europa.eu/veterinary/en/documents/download/e7da093e-0dc2-4120-8fa1-4beb29cc2a90",
  "Pentofelb.pdf",
  mode = "wb")

Anyway, here is something that should do what you need, I added Compositione, I'll let you do the rest (note that I also corrected the URL for Pentofel, which was the same as Leucofeligen in your example):

library(tidyverse)
library(pdftools)
#> Using poppler version 22.04.0

download.file("https://medicines.health.europa.eu/veterinary/en/documents/download/4320061e-8ef0-40c9-baec-b79fbca64569",
              "Pentofel.pdf", mode = "wb")
download.file("https://medicines.health.europa.eu/veterinary/en/documents/download/e7da093e-0dc2-4120-8fa1-4beb29cc2a90",
              "Leucofeligen.pdf", mode = "wb")

file.list <- list.files(pattern = "pdf|PDF$") 


# initialize the results dataframe
results <- tibble(filename = file.list,
                  Name = "",
                  Compositione = "")


for(row in 1:nrow(results)){
  file_name <- results$filename[row]
  
  content_by_page <- pdf_text(file_name)
  content_single_string <- str_c(content_by_page,
                                 collapse = "\n")
  content_by_line <- str_split_1(content_single_string,
                                 pattern = "\n")
  
  
  # note we don't capture the ANNEX II section titles which use A, B, C numbering
  # note also since we split by line, long titles like 6.6 are truncated
  
  # let's say we're only interested in ANNEX I
  start_annex_of_interest <- which(str_trim(content_by_line) == "ANNEX I") + 1
  end_annex_of_interest <- which(str_trim(content_by_line) == "ANNEX II") - 1
  
  content_of_interest <- content_by_line[start_annex_of_interest:end_annex_of_interest]
  
  
  is_annex_title <- str_detect(content_of_interest,
                               pattern = "^ *ANNEX [IVX]+ *$")
  is_section_title <- str_detect(content_of_interest,
                                 pattern = "^[0-9]+\\. +[A-Z ]+$")
  is_subsection_title <- str_detect(content_of_interest,
                                    pattern = "^[0-9]+\\.[0-9]+ +[A-Z][a-z ,()]+$")
  
  is_title <- is_annex_title| is_section_title | is_subsection_title
  
  
  # tibble(is_title = is_title,
  #        cont = content_of_interest) |>
  #   filter(is_title) |>
  #   View()
  
  
  
  # now, we can look for the desired information only in (sub)section titles
  
  # NAME ----
  start_section_of_interest <- which(str_detect(content_of_interest,
                                                pattern = "^[0-9]+\\. +NAME OF THE VETERINARY MEDICINAL PRODUCT$")) + 1
  
  
  end_section_of_interest <- which(is_title)[min(which(which(is_title) > start_section_of_interest))] - 1
  
  results$Name[row] <- content_of_interest[start_section_of_interest:end_section_of_interest] |>
    str_c(collapse = " ") |>
    str_trim()
  
  
  # COMPOSITIONE ----
  start_section_of_interest <- which(str_detect(content_of_interest,
                                                pattern = "^[0-9]+\\. +QUALITATIVE AND QUANTITATIVE COMPOSITION *$")) + 1
  
  
  end_section_of_interest <- which(is_title)[min(which(which(is_title) > start_section_of_interest))] - 1
  
  results$Compositione[row] <- content_of_interest[start_section_of_interest:end_section_of_interest] |>
    str_c(collapse = " ") |>
    str_trim()
}


results
#> # A tibble: 2 × 3
#>   filename         Name                                             Compositione
#>   <chr>            <chr>                                            <chr>       
#> 1 Leucofeligen.pdf LEUCOFELIGEN FeLV/RCP lyophilisate and suspensi… Per dose of…
#> 2 Pentofel.pdf     Fevaxyn Pentofel, suspension for injection for … Per dose of…

^{Created on 2023-07-18 with reprex v2.0.2}

I had also made a mistake by finding the titles in content_by_line, but once working on content_of_interest the line numbers don't match anymore.

Beavet82 · July 31, 2023, 7:37am

Hi AlexisW,

Many thanks again. Your code has worked perfectly for me!

All I will need to learn a bit more of regular expressions to fine tune to the real documents

Beatriz

system · August 7, 2023, 7:38am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.