Extracting Text from PDFs

clarktar · June 24, 2025, 8:49pm

I have some code setup to work through many files (individual PDFs) in a folder. They are all simple email exports into PDFs. I am having a hard time figuring out how to actually extract some of the text. I was able to use stringr to detect strings but cant figure out how to actually extract what I want and save it to a dataframe.

I would like to extract the name after the word "From:", the entire date string after the word "Date" and test if there are attachments (Y/N). And lastly, I would like the entire text from the body of the email to be captured in the column "text" in my dataframe. I have tried a few different ways for extracting the name and date which you will see under the comment "Extract who sent email and the date".

This Email example ideally would result in one row of a data frame such as:

{CA71520C-0A6F-40EE-BE85-D1DA3431D4DC}

library(pdftools)
library(purrr)
library(dplyr)
library(stringr)
# Set the working directory to where the PDF files are located
pdf_path <- "/Desktop/test"

# List all PDF files in the directory
file_list <- list.files(path = pdf_path, pattern = "*.pdf", full.names = TRUE)

# Define the regular expressions for extracting date and name
rx_date <- "Date:.*\\\\K\\\\b[0-9]{1,2}\\\\s[a-zA-Z]+\\\\s[0-9]{4}\\\\b"
rx_name <- "From:.*\\\\K\\\\b[_a-z0-9-]+(?:\\\\.[_a-z0-9-]+)*@[a-z0-9-]+(?:\\\\.[a-z0-9-]+)*\\\\.[a-z]{2,4}\\\\b"

# Function to extract text from PDF and save to Excel
extract_pdf_to_excel <- function(file_list) {
  # Read the PDF file
  pdf_text <- pdf_text(file_list)
  
  # Split the text into lines and full text
  lines <- unlist(strsplit(pdf_text, "\n"))
  full_text <- paste(pdf_text, collapse = " ")
  
  # Extract who sent the email and the date
   name <- str_detect(full_text,
                      pattern ="From:.*\n.*")
   date <- str_detect(full_text,
                      pattern = "Date:.*\\\\K\\\\b[0-9]{1,2}\\\\s[a-zA-Z]+\\\\s[0-9]{4}\\\\b")
  
  # Extract the text from the PDF into one text string, make lower case for case insensitive search of terms
   #full_text <- paste(pdf_text, collapse = " ")
   full_textL <- tolower(full_text)
   
   # Check for presence of attachments
  has_attachments <- grepl("attachment", "attachments", full_textL)
  has_attachments <- ifelse(has_attachments, "Yes", "No")
  # Create a data frame from the lines
  resultdf <- data.frame(filename = basename(file_list), name = name, date = date, attachment = has_attachments, text = full_text, stringsAsFactors = FALSE)
  return(resultdf)
}

result_list <- map(file_list, extract_pdf_to_excel)
# Combine all processed data into a single df
results_df <- bind_rows(result_list, .id = "file_id")

clarktar · June 25, 2025, 3:58pm

I have examined the files after converting using pdf_text. All files have the similar structure as seen below where the first line is the "From" and all text starts somewhere after line 6. Line 5 is blank unless there are attachments in which case the word Attachments is present in line 5.

Can anyone help me come up with a solution that will extract the name in line 1, date in line 4, Yes/No if "attachments" are present in line 5, and then extract all text after line 6?

Thanks!!

clarktar · June 25, 2025, 6:25pm

Well I have banged my head through this and resolved many of my issues. The current code that works

# List all PDF files in the directory
file_list <- list.files(path = pdf_path, pattern = "*.pdf", full.names = TRUE)

# Function to extract text from PDF and save to Excel
extract_pdf_to_excel <- function(file_list) {
  # Read the PDF file
  pdf_text <- pdf_text(file_list)
  
  # Split the text into lines and full text
  lines <- unlist(strsplit(pdf_text, "\n"))
  full_text <- paste(pdf_text, collapse = " ")
  
  # Extract who sent the email and the date
  name <- str_extract(lines, "(?<=From:\\s).*") 
  date <- str_extract(lines, "(?<=Date:\\s).*")
  
  # Extract the text from the PDF into one text string, make lower case for case insensitive search of terms
   full_text <- paste(pdf_text, collapse = " ")
   full_textL <- tolower(full_text)
   
   # Check for presence of attachments
  has_attachments <- str_detect(lines[[5]], "Attachments")
  yes_no_result <- ifelse(has_attachments, "Yes", "No")
  
  # Extract body of email
  text_after_line_5 <- lines[6:length(lines)] # This is not working

  # Create a data frame from the lines
  resultdf <- data.frame(filename = basename(file_list), name = name, date = date, attachment = yes_no_result, 
                         text = NA, stringsAsFactors = FALSE)
  return(resultdf)
}

result_list <- map(file_list, extract_pdf_to_excel)

# Combine all processed data into a single df
results_df <- bind_rows(result_list, .id = "file_id")

# Clean up the results data frame by grouping.
results_df <- results_df |> 
  group_by(file_id) |> 
  summarise(
    filename = first(filename),
    name = first(name),
    date = first(na.omit(date)),
    attachment = first(attachment),
    text = paste(text, collapse = " "),
    .groups = "drop"
  )

The only remaining issue is where I try to extract the body of the email. The body of the email seems to start after line 5 in each of my PDFs. But my code does not work. I get this error

Error in `map()`:
ℹ In index: 1.
Caused by error in `data.frame()`:
! arguments imply differing number of rows: 1, 60, 55
Run `rlang::last_trace()` to see where the error occurred.

I am pretty sure this has to do with the fact that each email (PDF in this case) has a different number of lines (rows when I split by line). You can see in the code under the "Create a data frame from the lines" I have text = NA

# Create a data frame from the lines
  resultdf <- data.frame(filename = basename(file_list), name = name, date = date, attachment = yes_no_result, 
                         text = text_after_line_5, stringsAsFactors = FALSE)

This allows the code to run without error and obviously it populates the "text" field with NAs rather than the object "text_after_line_5".

Not sure how to correct this issue and any help would be appreciated!