I have a folder of about 2000 .pdf files containing laboratory results. All files are in a similar format and layout.
I have been trying to read all of the .pdf files into R, then extract data from relevant fields for analysis... but really struggling with the regex.
Here is an example:
library(tidyverse)
library(pdftools)
#read in all the .pdf files
file.list <- list.files(pattern='\\.pdf')
x <- map(file.list, ~ pdf_text(.))
names(x) <- gsub("\\.pdf", "", file.list)
Created on 2019-03-26 by the reprex package (v0.2.1)
The imported .pdf files basically look like this (but much longer in reality)
x <- list(file1 = "OTHER TEXT\nSample ID: PRO22884Z- OTHER TEXT \nTest Result: NOT DETECTED\n OTHER TEXT End Time: 21/01/19 17:21:10\n OTHER TEXT ",
file2 = "OTHER TEXT\nSample ID: PRO33443M- OTHER TEXT \nTest Result: DETECTED\n OTHER TEXT End Time: 22/01/19 18:04:34\n OTHER TEXT ",
file3 = "OTHER TEXT\nSample ID: PRO112236- OTHER TEXT \nTest Result: DETECTED\n OTHER TEXT End Time: 14/02/19 09:34:17\n OTHER TEXT ")
Created on 2019-03-26 by the reprex package (v0.2.1)
Now, I guess I need to map
through each item in the list, and extract the fields I need... but this is where I am getting lost in regex symbols...
The end result should look like this:
library(tidyverse)
output <- tribble(
~sample_id, ~end_time, ~test_result,
"PRO22884Z", "21/01/19 17:21:10", "NOT DETECTED",
"PRO33443M", "22/01/19 18:04:34", "DETECTED",
"PRO112236", "14/02/19 09:34:17", "DETECTED"
)
output
#> # A tibble: 3 x 3
#> sample_id end_time test_result
#> <chr> <chr> <chr>
#> 1 PRO22884Z 21/01/19 17:21:10 NOT DETECTED
#> 2 PRO33443M 22/01/19 18:04:34 DETECTED
#> 3 PRO112236 14/02/19 09:34:17 DETECTED
Created on 2019-03-26 by the reprex package (v0.2.1)
Any help greatfully received.