Hi all,
I'm attempting to read a table into R using the packages pdftools and tabulizer. On the whole, I've been successful in this, other than with the first page.
The first page of the PDF is split into two halves, the top being a text box and the bottom being the beginning of the table I want to read in. When using the function extract_tables, only the text box at the top is extracted, and the table below ignored. The rest of the table on subsequent pages is successfully read in.
The code I've been using:
library(tidyverse)
library(here)
library(pdftools)
library(tabulizer)
library(plyr)
#Read in file location
pdf_file <- here::here("pdf_location.pdf")
#Exam text recognised - here the whole of the first page is read in, the textbox and teh table
text <- pdf_text(pdf_file)
#Extract table
tables <- tabulizer::extract_tables(pdf_file,
pages = c(1))
#Run to examine results. Here the bottom half of the first page is missing.
tables
I've attempted using GUESS=FALSE and AREA=..., but I'm failing to get results from either. Has anyone solved a similar issue? Thank you!