Reading text from pdf files with proper format

YanjiD · March 13, 2020, 6:35pm

I am conducting text mining on a bunch of pdf files and have to convert the pdf to txt format first. While I don't have any issue with reading the pdf files, I realize that the generated txt files reserve the same format as the pdf. Therefore, in my txt files, a lot of them have multiple columns on one page. While it is fine for reading, it does not fit for analysis (since when analyzing the text, the software will assume the content from one page are in the same paragraph). I upload a pic for an example, there are two paragraphs on the same page, but the software will not be correctly identified them. Instead, the software will interpret them as one paragraph.

So I am wondering whether there is any way for me (when I read the pdf in R) to properly format the content before converting them into txt. I have searched extensively but cannot find any solution. I will be grateful for any insight/suggestion you might have and thank you very much in advance!

dromano · March 13, 2020, 6:56pm

Hi @YanjiD, what tools do you use to read pdfs in R?

YanjiD · March 13, 2020, 7:04pm

Hello, @dromano I use the pdf_text comment from the pdftools package in R to read the pdfs.

dromano · March 13, 2020, 8:09pm

Thanks, @YanjiD. Is the image you posted an image of the output of pdf_text(), or is it an image of the original pdf?

YanjiD · March 13, 2020, 9:14pm

It is a screen shot of the txt file generated by the pdf_text, but I believe the original pdf file should have the same format (two paragraphs in that page).

dromano · March 13, 2020, 9:22pm

Could you post the text here, between a pair of triple backticks (```), like this?

```
<-- paste here
```

That might be a place to start, and if you could make the pdf available somehow, folks could try to help directly with that, too.

system · April 3, 2020, 9:22pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.