I think the R package fs can be very helpful for file searching and the pdftools package for loading data from PDFs and searching them. Hope this helps.
Thanks for your message, Matt. One thing that I am thinking about is how to save the data. Shall I save the preprocessed data in a database or just an RDS object will be OK?
The other point is differentiating them to allow the user to find what he wants with the least effort. i.e. a keyword might be a related word for a business such as "bridge", "road" or maybe that is not an important word, although that is the best way to differentiate the documents by an algorithm we develop. Maybe I should first build it and not be worry about them at all. I think our way of searching documents might be different from the best way to do it by a random forest approach. What do you think?
Maybe I should learn from the questions asked to fine-tune or create a system to search based on the historical questions asked by each user.
Hi @Mehrdad , I will soon need a similar approach as I'm in the process of developing Shiny apps for legal tech clients. So I'm interested to know which option you'll end up having.
I'd like to avoid the big artillery, which is the case of one of my clients:
connectors to fetch and discover metadata to different systems (not needed here though, I guess there will be only one source of data)
converters with OCR option to extract the actual raw text (quite easy here since we're dealing only with PDFs)
all saved in a Postgres database
index builder to extract entities, apply rules, ...
index builder pushes data to SOLR
REST service and search GUI
note: there's a data analytics API that connects to both Postgres and SOLR to report search statistics from users
But that's a whole lot (and actually only part of the architecture) and I hope there's an easier way. I'll get back to you if/when I find something that matches the type of deployment you're facing.
Hi @xvalda,
Thanks for your reply. It was a great help. I haven't heard about the SOLR and never thought of converters (OCR) for PDF documents. That will be interesting. If you know a library that could help, please let me know.
BTW, where are you? Are you available to meet up and chat with more details? Probably we can join forces and create a product together and help each other even in the sales part.
I am in Canada and I would be happy to connect. If it helps. please choose a date/time from my contact page..
Hello again @Mehrdad ,
Happy to connect to discuss more in details, I'll reach out.
For OCR you can use Tesseract that is not too hard to use (with basic configuration).