Categorizing and quick Searching a large number of PDF files

Mehrdad · January 7, 2022, 7:48am

Hi Friends,

I have 50,000+ PDF files which grow daily by 10-100. I want to search the files on the disk and preprocess them, cluster them and search them faster.

I can do it somehow, although it would be a great help to have your opinion about the best approach.

To categorize the tasks, here is what comes to my mind.

Data Extraction
Clustering the Documents for better search
Searching
Updating the list of file changes in the disk

Thanks, everyone for your feedback. Anything for each of the above tasks will help a lot,

Kind regards,

Mehrdad

mattwarkentin · January 7, 2022, 3:53pm

I think the R package fs can be very helpful for file searching and the pdftools package for loading data from PDFs and searching them. Hope this helps.

Mehrdad · January 7, 2022, 8:28pm

Thanks for your message, Matt. One thing that I am thinking about is how to save the data. Shall I save the preprocessed data in a database or just an RDS object will be OK?
The other point is differentiating them to allow the user to find what he wants with the least effort. i.e. a keyword might be a related word for a business such as "bridge", "road" or maybe that is not an important word, although that is the best way to differentiate the documents by an algorithm we develop. Maybe I should first build it and not be worry about them at all. I think our way of searching documents might be different from the best way to do it by a random forest approach. What do you think?
Maybe I should learn from the questions asked to fine-tune or create a system to search based on the historical questions asked by each user.

Thanks for your attention,

Mehrdad

xvalda · January 7, 2022, 9:09pm

Hi @Mehrdad , I will soon need a similar approach as I'm in the process of developing Shiny apps for legal tech clients. So I'm interested to know which option you'll end up having.

I'd like to avoid the big artillery, which is the case of one of my clients:

connectors to fetch and discover metadata to different systems (not needed here though, I guess there will be only one source of data)
converters with OCR option to extract the actual raw text (quite easy here since we're dealing only with PDFs)
all saved in a Postgres database
index builder to extract entities, apply rules, ...
index builder pushes data to SOLR
REST service and search GUI
note: there's a data analytics API that connects to both Postgres and SOLR to report search statistics from users

But that's a whole lot (and actually only part of the architecture) and I hope there's an easier way. I'll get back to you if/when I find something that matches the type of deployment you're facing.

Mehrdad · January 7, 2022, 9:39pm

Hi @xvalda,
Thanks for your reply. It was a great help. I haven't heard about the SOLR and never thought of converters (OCR) for PDF documents. That will be interesting. If you know a library that could help, please let me know.
BTW, where are you? Are you available to meet up and chat with more details? Probably we can join forces and create a product together and help each other even in the sales part.
I am in Canada and I would be happy to connect. If it helps. please choose a date/time from my contact page..

Thanks for your help.

kind regards,

Mehrdad

xvalda · January 7, 2022, 9:47pm

Hello again @Mehrdad ,
Happy to connect to discuss more in details, I'll reach out.
For OCR you can use Tesseract that is not too hard to use (with basic configuration).

Here's a concrete example

All the best,
Xavier

Mehrdad · January 7, 2022, 10:21pm

Perfect! Thank you @xvalda !

system · January 28, 2022, 10:21pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.