Iimport pdf files and analyze

Shin1 · November 27, 2021, 5:09pm

Hi guys,
I would like to analyze a pdf file that contains text information about occupations, so I would like to obtain only the description of these occupations. Do you know how I could do it? The file is this: ISCO-08_vol1_web.pdf (ilo.org)
I would like to obtain the description in the unit group.
Example: Unit Group 5321
Health Care Assistants
Health care assistants provide direct personal assistants with activities of daily living to patients and residents in a variety of care settings such as hospitals, clinics and residential nursing care facilities.
Thank you for your availability.

Best regards,

xvalda · December 5, 2021, 3:58pm

Hi @Shin1

Your problem is much trickier that it seems at first. We would normally import the text with the pdftools package and wrangle around, but as you may have noticed, the text cannot be imported and returns mostly characters like this:

Either accidental or intentional, the toUnicode table is broken, so you can see the glyphs on the screen but they don’t map to actual Unicode characters. So even if you copy the text in any editor, the problem will remain.

The only solution I see is to OCR the document.

The second problem is the layout of the page, some lines of text take the full width of the page while others are spread across two columns.

But noticing the following patterns, I could write some code that does the job:

Useful content is from page 100 to page 371
Content consists of Major Groups, Sub-Major Groups, Minor Groups and Unit Groups. The two side by side columns layout is used only for Unit Groups, which is our content of interest. So we’ll need to OCR half pages.
Each Unit Group is organized this way:
- Unit Group#: always on one line
- Job Title: on one or two lines
- Description: on several lines
- The text “Tasks include” is a boundary for the description text, it is followed by other content that we can filter out

Based on that, I wrote some code that will take a fairly long time to run. It may take up to one hour. OCR process can be a bit long when we have to deal with many pages.

I commented the code (although I was a bit cheap on the comments), but I hope it’ll help understand.

Note that OCR is not accurate and some characters may be wrongly detected (i can become l, ...).

#load packages, note that dplyr needs to be loaded before the tidyverse to avoid conflicts
packages <- c("plyr", "tidyverse", "pdftools", "tesseract", "magick")
lapply(packages, require, character.only = TRUE)

#create directory to store images (each page will be saved as an image)
dir.create("pngs")
#change directory to pngs
setwd("pngs")
#convert pdf to png format, it took me about 15 minutes
pngfile <- pdf_convert("https://www.ilo.org/public/english/bureau/stat/isco/docs/publication08.pdf", 
                                 pages = 100:371,
                                 dpi = 600)
#reset your wd to the upper level
setwd("..")

#function to OCR the content, this is not done for the full page but for each half page since we have two columns
#the function will store CSV files, one for each page of text, for this you need to create a new directory:
dir.create("csvs")

run_ocr <- function(page_number){
  #I took the code below here: https://www.r-bloggers.com/2019/03/get-text-from-pdfs-or-images-using-ocr-a-tutorial-with-tesseract-and-magick/
  img_name <- paste0(getwd(), "/pngs/", list.files("pngs")[page_number-99])
  images <- map(img_name, image_read)
  first_half <- map(images, ~image_crop(., geometry = "2550x6500"))
  second_half <- map(images, ~image_crop(., geometry = "2550x6500+2550+0"))
  merged_list <- prepend(first_half, NA) %>% 
    reduce2(second_half, c) %>% 
    discard(is.na)
  text_list <- map(merged_list, ocr)
  text_list <- text_list %>% 
    map(., ~str_split(., "\n"))
  #unlist and save content of each left and right content on top of other
  pdf_lhs <- text_list[[1]] %>% unlist() %>% as_tibble() %>% mutate(value = ifelse(value == "", NA, value)) %>% drop_na()
  pdf_rhs <- text_list[[2]] %>% unlist() %>% as_tibble() %>% mutate(value = ifelse(value == "", NA, value)) %>% drop_na()
  pdf <- bind_rows(pdf_lhs, pdf_rhs)
  write_csv(pdf, paste0("csvs/pdf", page_number, ".csv"))
}

#run the function on all pages, 100 to 371, it took me about 30 minutes
map(100:371, run_ocr)

#read all your CSVs at once and create a unique tibble ("pdf") 
csvs <- "csvs"
csvfiles <- list.files(path=csvs, pattern="*.csv", full.names=TRUE)
pdf <- ldply(csvfiles, read_csv)
pdf <- as_tibble(pdf)

#process the tibble to filter only with the content you need
#you can run the code in chunks to see what is done at each stage
pdf_formated <- pdf %>% 
  mutate(unit_group = str_extract(value, "^Unit Group [0-9]{3,}$"),
         line_type = ifelse(str_detect(value, "Tasks include"), "Tasks include", NA)) %>% 
  fill(unit_group) %>% 
  group_by(unit_group) %>% 
  mutate(linecount = row_number(), 
         line_type = case_when(
           linecount == 1 & str_detect(value, "Unit Group") ~ "Unit Group", 
           linecount == 2 ~ "Job Title", 
           linecount == 3 & nchar(value) < 35 ~ "Job Title", 
           linecount == 3 & nchar(value) >= 35 ~ "Description", 
           linecount == 4 & str_detect(value, "Tasks include", negate = TRUE) ~ "Description", 
           line_type == "Tasks include" & lag(line_type) == "Tasks include" ~ NA_character_, 
           TRUE ~ line_type
         )) %>% fill(line_type) %>% 
  ungroup() %>% 
  filter(line_type %in% c("Unit Group", "Job Title", "Description")) %>% 
  mutate(value = str_squish(value) %>% str_c(" "),
         value = str_remove(value, "(-|- )$")
         )
  
#summarise the pdf tibble to have one row per unit group, with unit group number, job title and description
pdf_table <- pdf_formated %>% group_by(unit_group) %>% 
  summarise(unit_group = value[line_type == "Unit Group"], 
            job_title = paste(value[line_type == "Job Title"], collapse = ""),
            description = paste(value[line_type == "Description"], collapse = "")
            ) %>% ungroup() %>% 
  #remove trailing white spaces
  mutate(across(everything(), ~ str_trim(.))) %>% 
  #if description field more than 1000 characters, keep only two first sentences
  mutate(description = ifelse(nchar(description) > 1000, str_extract(description, '(.*?[a-z0-9][.](?= )){1,2}'), description))

#save the file
write_csv(pdf_table, "pdf_isco08.csv")

If anyone reading this has some improvement ideas (cleaner code, more efficient and quicker workflow, ...) its very welcome.

PS: I took some of the code from this article: Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} | R-bloggers

xvalda · December 6, 2021, 5:23pm

@Shin1
I edited the code above to give a cleaner output of the text. If you started OCRing the document you don't need to run it again, the part I edited starts at pdf_formated and running the code after takes only a few seconds.

system · December 27, 2021, 5:24pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.