Text Retrieval from an Image

Hello All,
I am trying to read an .PNG image and need to extract the text from that image.
I used tesseract and magick packages but output is weird.Please find the below the source as well as the output ,please let me know what should be done to solve it.

NOTE:M rookie in R programming

Source:
C:\Users\Prasanna.Mathivanan1>echo PRASANNA,Mani& echo.ALJHU,Bala& echo.SUDJOS,SU& echo.SUDJOS1,SU
PRASANNA,Mani
ALJHU,Bala
SUDJOS,SU
SUDJOS1,SU

Output:

image_read("C:/Users/Prasanna.Mathivanan1/Desktop/Image_Processing/git.PNG") %>%
+ image_resize("1000x") %>%
+ image_convert(type = 'grayscale') %>%
+ image_ocr() %>%
+ cat()
C:\Users\Prasanna.Mathivanan1>echo PRASANNA,Mani& echo.ALJHU,Bala& echo.SUDJOS,SU& echo.SUDJOS1,SU
CUCU WEEE

Nia)

UPL ee

ese)

tesseract::ocr("C:/Users/Prasanna.Mathivanan1/Desktop/Image_Processing/git.PNG", engine = tesseract("eng"), HOCR = FALSE)
[1] "ANU NT UCB PRU es ele re Ea Lee UCB) a Led ELLOS Lose ee\nAOU MICuEe\n\nALIJHU, Bala\n\nTee]\n\nTse ee)\n"

Maybe the images are too poor quality to pull a good OCR on them? can you share the PNG ?

i converted the image to black and white and then inverted.
do you get any different performance on it ?

Please find the output for the inverted image u shared.

In source we have "ALJHU" whereas in the output it is "ALIHU"
J is reported as I..

Please anyone provide a solution to fix the above issue

OCR is always going to be somewhat prone to errors, but the tesseract library and its R bindings provide various methods you can try to improve your results.

See the Preprocessing with Magick and Tesseract Control Parameters sections of the tesseract vignette: Using the Tesseract OCR engine in R

The latter section will also get you to the Tesseract documentation on "Improving the quality of the output":

I tried the things in Using the Tesseract OCR engine in R still i faced issues.
:sob:
might be i need to drill down more

maybe its worth backing up a step ..
Must you create a solution involving OCR ?
If you are capturing content of a dos terminal, there are ways to capture that that dont involve imagery. Perhaps you can describe something about your underlying use case and requirements so we can see if there is something more effective than reliance on OCR ?

My requirement is a screenshot would be provided by the client and from that i need to retrieve the text available in the image.
Client would provide the image from dos terminal or from Putty or the output derived after executing a script.
That's why i opted for OCR to fulfill my requirement.

I am open to any suggestions.

Is this about logging the clients activity ? there would be better ways to transmit a log of an output than OCR.
when a script is run from command line, you can always pipe its output to a file, then emailing that file will perfectly reproduce the scripts output without the need for reading an image :slight_smile:

1 Like

the client logs his activity and he sends the picture.
We are supposed to retrieve the text information from the text and we should add it in our log document.

I'm sorry to say, this approach sounds a mistake. It's very common for people to run process, log result, send outcome, send log. It is not standard practice to use image and OCR for this. Please consider alternatives, for your own sake.

1 Like

Client is not going to change his approach.
They will provide the image from dos terminal i neeed to retrieve the text from the image and provide to another client.
To me my requirement has to be fulfilled so i am open to anything.

In that case, I think you're going to need to take a deep dive into the tesseract documentation I linked to earlier. The R package is just a binding to the tesseract library, so I'd go right to the source.

You might look into customizing your configuration to deal with the specific input. I've never done this, but it's the

There's also a tesseract forum, but I think there are quite a few worthwhile approaches described in the manual to review before heading there.
https://groups.google.com/forum/?fromgroups#!forum/tesseract-ocr

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.