Text Retrieval from an Image

maniya90 · March 12, 2020, 5:24pm

Hello All,
I am trying to read an .PNG image and need to extract the text from that image.
I used tesseract and magick packages but output is weird.Please find the below the source as well as the output ,please let me know what should be done to solve it.

NOTE:M rookie in R programming

Source:
C:\Users\Prasanna.Mathivanan1>echo PRASANNA,Mani& echo.ALJHU,Bala& echo.SUDJOS,SU& echo.SUDJOS1,SU
PRASANNA,Mani
ALJHU,Bala
SUDJOS,SU
SUDJOS1,SU

Output:

image_read("C:/Users/Prasanna.Mathivanan1/Desktop/Image_Processing/git.PNG") %>%
+ image_resize("1000x") %>%
+ image_convert(type = 'grayscale') %>%
+ image_ocr() %>%
+ cat()
C:\Users\Prasanna.Mathivanan1>echo PRASANNA,Mani& echo.ALJHU,Bala& echo.SUDJOS,SU& echo.SUDJOS1,SU
CUCU WEEE

Nia)

UPL ee

ese)

tesseract::ocr("C:/Users/Prasanna.Mathivanan1/Desktop/Image_Processing/git.PNG", engine = tesseract("eng"), HOCR = FALSE)
[1] "ANU NT UCB PRU es ele re Ea Lee UCB) a Led ELLOS Lose ee\nAOU MICuEe\n\nALIJHU, Bala\n\nTee]\n\nTse ee)\n"

nirgrahamuk · March 12, 2020, 6:30pm

Maybe the images are too poor quality to pull a good OCR on them? can you share the PNG ?

maniya90 · March 12, 2020, 7:40pm

nirgrahamuk · March 14, 2020, 12:40am

i converted the image to black and white and then inverted.
do you get any different performance on it ?

maniya90 · March 14, 2020, 4:15am

Please find the output for the inverted image u shared.

In source we have "ALJHU" whereas in the output it is "ALIHU"
J is reported as I..

maniya90 · March 17, 2020, 12:39pm

Please anyone provide a solution to fix the above issue

mara · March 17, 2020, 2:18pm

OCR is always going to be somewhat prone to errors, but the tesseract library and its R bindings provide various methods you can try to improve your results.

See the Preprocessing with Magick and Tesseract Control Parameters sections of the tesseract vignette: Using the Tesseract OCR engine in R

The latter section will also get you to the Tesseract documentation on "Improving the quality of the output":

maniya90 · March 17, 2020, 2:51pm

I tried the things in Using the Tesseract OCR engine in R still i faced issues.

might be i need to drill down more

nirgrahamuk · March 17, 2020, 3:08pm

maybe its worth backing up a step ..
Must you create a solution involving OCR ?
If you are capturing content of a dos terminal, there are ways to capture that that dont involve imagery. Perhaps you can describe something about your underlying use case and requirements so we can see if there is something more effective than reliance on OCR ?

maniya90 · March 17, 2020, 3:29pm

My requirement is a screenshot would be provided by the client and from that i need to retrieve the text available in the image.
Client would provide the image from dos terminal or from Putty or the output derived after executing a script.
That's why i opted for OCR to fulfill my requirement.

I am open to any suggestions.

nirgrahamuk · March 17, 2020, 3:51pm

Is this about logging the clients activity ? there would be better ways to transmit a log of an output than OCR.
when a script is run from command line, you can always pipe its output to a file, then emailing that file will perfectly reproduce the scripts output without the need for reading an image

maniya90 · March 17, 2020, 4:35pm

the client logs his activity and he sends the picture.
We are supposed to retrieve the text information from the text and we should add it in our log document.

nirgrahamuk · March 17, 2020, 8:33pm

I'm sorry to say, this approach sounds a mistake. It's very common for people to run process, log result, send outcome, send log. It is not standard practice to use image and OCR for this. Please consider alternatives, for your own sake.

maniya90 · March 18, 2020, 6:33am

Client is not going to change his approach.
They will provide the image from dos terminal i neeed to retrieve the text from the image and provide to another client.
To me my requirement has to be fulfilled so i am open to anything.

mara · March 18, 2020, 2:10pm

In that case, I think you're going to need to take a deep dive into the tesseract documentation I linked to earlier. The R package is just a binding to the tesseract library, so I'd go right to the source.

You might look into customizing your configuration to deal with the specific input. I've never done this, but it's the

github.com

tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data

TESSERACT(1)
============
:doctype: manpage

NAME
----
tesseract - command-line OCR engine

SYNOPSIS
--------
*tesseract* 'FILE' 'OUTPUTBASE' ['OPTIONS']... ['CONFIGFILE']...

DESCRIPTION
-----------
tesseract(1) is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
at Google since then.

This file has been truncated. show original

There's also a tesseract forum, but I think there are quite a few worthwhile approaches described in the manual to review before heading there.
https://groups.google.com/forum/?fromgroups#!forum/tesseract-ocr

system · April 8, 2020, 2:10pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.