This is following my previous question..
I was trying to read from a PDF file and create a data frame with specific fields from the file.
I have about 1000+ reports generated weekly. This is how far I have gotten so far
I am reading the PDF file using
text = extract_text(file = "C:/Work/R/text extraction/Extration_tests/WO-09017974A.pdf",
pages = c(1,6)) ## The two pages with the data of importance
and then putting it into a text file
sink("extract_text.txt")
cat(text, sep = "/n")
sink()
Next I am reading this text file (file output pasted below) into a list
ICRE = readLines("extract_text.txt")
and creating a output dataframe with the fields I want
extracted = data.frame("date" = ICRE[[6]],
"WO" = ICRE [[ 10]],
"Incident description" = paste(ICRE[[c(42)]],ICRE[[c(43)]]), ### using paste because the related data is in two lines
"Impact on customer" = ICRE[[45]],
"Condition of system" = ICRE[[47]]
)
Note: This is just a sample dataframe and I will be adding more fields from the text file.
This is how the text file I am using to extract the above data looks.
Text file from output (data has been anonymized for some entries)
Internal
Repair Report
10/20/2021
site1 - Site
Our reference: 0 / WO-09017974 Customer reference:
Report prepared by ABC Customer contact: DEF
Lines 11-38 below can be skipped, hence deleted*NOTE:All bold text are column headers (sometimes end with a period), non bold text is row (follows the period in some cases)
Incident Description (line 39 in the report)_
Incident description:
Troubleshoot the issue with
ABC and BBB
Impact on customer. No disturbance
Condition of system upon arrival. ALL is OK
_
Investigation and Analysis
_
Action to protect the block.
It was discussed to not transfer the block since the other block is
already in PPP.Other
Circumstance of the fault. No particular circumstance
Premises and environment visual check.
The premises are clean / The
premises are well-ventilated
I have a few of questions
- Is this the best way to do this?
- I would like to automate this process for 1000+files coming in every week
- The primary key for the data is the field WO-09017974,, which is the first entry in the dataset but as on now in the DF I get the whole line "Our reference: 0 / WO-09017974 Customer reference:", I only need to extract WO-09017974. How can I do this?
- Is there a way to automate the extraction process (that I am basing on page numbers right now
text = extract_text(file = "C:/Work/R/text extraction/Extration_tests/WO-09017974A.pdf",pages = c(1,6))
) with out using page numbers but just key phrases?
Sorry for the overly long question and thanks in advance for any help.