Hi @Shin1
Your problem is much trickier that it seems at first. We would normally import the text with the pdftools package and wrangle around, but as you may have noticed, the text cannot be imported and returns mostly characters like this:
Either accidental or intentional, the toUnicode table is broken, so you can see the glyphs on the screen but they don’t map to actual Unicode characters. So even if you copy the text in any editor, the problem will remain.
The only solution I see is to OCR the document.
The second problem is the layout of the page, some lines of text take the full width of the page while others are spread across two columns.
But noticing the following patterns, I could write some code that does the job:
- Useful content is from page 100 to page 371
- Content consists of Major Groups, Sub-Major Groups, Minor Groups and Unit Groups. The two side by side columns layout is used only for Unit Groups, which is our content of interest. So we’ll need to OCR half pages.
- Each Unit Group is organized this way:
- Unit Group#: always on one line
- Job Title: on one or two lines
- Description: on several lines
- The text “Tasks include” is a boundary for the description text, it is followed by other content that we can filter out
Based on that, I wrote some code that will take a fairly long time to run. It may take up to one hour. OCR process can be a bit long when we have to deal with many pages.
I commented the code (although I was a bit cheap on the comments), but I hope it’ll help understand.
Note that OCR is not accurate and some characters may be wrongly detected (i can become l, ...).
#load packages, note that dplyr needs to be loaded before the tidyverse to avoid conflicts
packages <- c("plyr", "tidyverse", "pdftools", "tesseract", "magick")
lapply(packages, require, character.only = TRUE)
#create directory to store images (each page will be saved as an image)
dir.create("pngs")
#change directory to pngs
setwd("pngs")
#convert pdf to png format, it took me about 15 minutes
pngfile <- pdf_convert("https://www.ilo.org/public/english/bureau/stat/isco/docs/publication08.pdf",
pages = 100:371,
dpi = 600)
#reset your wd to the upper level
setwd("..")
#function to OCR the content, this is not done for the full page but for each half page since we have two columns
#the function will store CSV files, one for each page of text, for this you need to create a new directory:
dir.create("csvs")
run_ocr <- function(page_number){
#I took the code below here: https://www.r-bloggers.com/2019/03/get-text-from-pdfs-or-images-using-ocr-a-tutorial-with-tesseract-and-magick/
img_name <- paste0(getwd(), "/pngs/", list.files("pngs")[page_number-99])
images <- map(img_name, image_read)
first_half <- map(images, ~image_crop(., geometry = "2550x6500"))
second_half <- map(images, ~image_crop(., geometry = "2550x6500+2550+0"))
merged_list <- prepend(first_half, NA) %>%
reduce2(second_half, c) %>%
discard(is.na)
text_list <- map(merged_list, ocr)
text_list <- text_list %>%
map(., ~str_split(., "\n"))
#unlist and save content of each left and right content on top of other
pdf_lhs <- text_list[[1]] %>% unlist() %>% as_tibble() %>% mutate(value = ifelse(value == "", NA, value)) %>% drop_na()
pdf_rhs <- text_list[[2]] %>% unlist() %>% as_tibble() %>% mutate(value = ifelse(value == "", NA, value)) %>% drop_na()
pdf <- bind_rows(pdf_lhs, pdf_rhs)
write_csv(pdf, paste0("csvs/pdf", page_number, ".csv"))
}
#run the function on all pages, 100 to 371, it took me about 30 minutes
map(100:371, run_ocr)
#read all your CSVs at once and create a unique tibble ("pdf")
csvs <- "csvs"
csvfiles <- list.files(path=csvs, pattern="*.csv", full.names=TRUE)
pdf <- ldply(csvfiles, read_csv)
pdf <- as_tibble(pdf)
#process the tibble to filter only with the content you need
#you can run the code in chunks to see what is done at each stage
pdf_formated <- pdf %>%
mutate(unit_group = str_extract(value, "^Unit Group [0-9]{3,}$"),
line_type = ifelse(str_detect(value, "Tasks include"), "Tasks include", NA)) %>%
fill(unit_group) %>%
group_by(unit_group) %>%
mutate(linecount = row_number(),
line_type = case_when(
linecount == 1 & str_detect(value, "Unit Group") ~ "Unit Group",
linecount == 2 ~ "Job Title",
linecount == 3 & nchar(value) < 35 ~ "Job Title",
linecount == 3 & nchar(value) >= 35 ~ "Description",
linecount == 4 & str_detect(value, "Tasks include", negate = TRUE) ~ "Description",
line_type == "Tasks include" & lag(line_type) == "Tasks include" ~ NA_character_,
TRUE ~ line_type
)) %>% fill(line_type) %>%
ungroup() %>%
filter(line_type %in% c("Unit Group", "Job Title", "Description")) %>%
mutate(value = str_squish(value) %>% str_c(" "),
value = str_remove(value, "(-|- )$")
)
#summarise the pdf tibble to have one row per unit group, with unit group number, job title and description
pdf_table <- pdf_formated %>% group_by(unit_group) %>%
summarise(unit_group = value[line_type == "Unit Group"],
job_title = paste(value[line_type == "Job Title"], collapse = ""),
description = paste(value[line_type == "Description"], collapse = "")
) %>% ungroup() %>%
#remove trailing white spaces
mutate(across(everything(), ~ str_trim(.))) %>%
#if description field more than 1000 characters, keep only two first sentences
mutate(description = ifelse(nchar(description) > 1000, str_extract(description, '(.*?[a-z0-9][.](?= )){1,2}'), description))
#save the file
write_csv(pdf_table, "pdf_isco08.csv")
If anyone reading this has some improvement ideas (cleaner code, more efficient and quicker workflow, ...) its very welcome.
PS: I took some of the code from this article: Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} | R-bloggers