OCR using tesseract, magick

animeshsarraf · April 30, 2019, 8:34am

For discussions related to modeling, machine learning and deep learning. Related packages include caret, modelr, yardstick, rsample, parsnip, tensorflow, keras, cloudml, and tfestimators.

Working on a Data extraction from Invoice pdf.

How can the background be made white so texts are easily read.
Also parameters to tune to get more accurate texts from invoice

Andrea · April 30, 2019, 12:12pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

animeshsarraf · April 30, 2019, 12:39pm

*ibrary(magick)
library(tiff)
library(magrittr)
library(imager)

library(png)
library(tesseract)
library(stringr)
library(qdapRegex)

imagepath ="C:/OCR-Docs/"

all_image <- list.files(imagepath, pattern=".jpg",all.files=T, full.names=F, no.. = T)
all_image
eng <- tesseract("eng")

all_text <-c(" ")
for(i in 1:length(all_image)){
fname<-paste(imagepath,all_image[i],sep="")
all_text[i] <- image_read(fname) %>%
image_resize("2000x") %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
tesseract::ocr()
print(i)
print(all_text[i])
}

This is the exact code used for extracting the data out of invoice.
Please help me improve the accuracy of the text output also pre-process by changing background of invoices to white.

paleolimbot · April 30, 2019, 4:57pm

I think what you're looking for is local adaptive thresholding (via the magick::image_lat() function). Below is some code I used to process a few hundred pages that produced reasonable results. Note that I also used image_deskew()...tesseract is (I think) fairly sensitive to having straight lines be straight. The cropping is probably specific to my images, but the resolution of the images did seem to matter as I remember (this was a while ago, and results are probably better using the new version of tesseract).

library(magick)
img_files <- list.files(pattern = "[0-9]\\.png$", full.names = TRUE)
message("Cleaning image files...")
for(img_file in img_files) {
  message(img_file)
  img <- image_read(img_file) %>%
    image_crop(geometry_area(width = 1500, height = 2100, x_off = 150, y_off = 50)) %>%
    image_convert(colorspace = "gray") %>%
    image_negate() %>%
    # local adaptive thresholding
    image_lat(geometry = "20x10+5%") %>%
    image_negate() %>%
    image_deskew() %>%
    image_trim() %>%
    image_threshold() %>%
    image_border(geometry = "50x50", color = "white")

  image_write(img, str_replace(img_file, ".png$", ".clean.png"))
}

animeshsarraf · May 2, 2019, 5:03am

This code does not terminate properly.
The code is locked on processing the last image in the "img_files" and also causes R studio to restart on manually terminating the execution.

system · May 23, 2019, 5:03am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

OCR using tesseract, magick

all_text <-c(" ") for(i in 1:length(all_image)){ fname<-paste(imagepath,all_image[i],sep="") all_text[i] <- image_read(fname) %>% image_resize("2000x") %>% image_convert(type = 'Grayscale') %>% image_trim(fuzz = 40) %>% tesseract::ocr() print(i) print(all_text[i]) }

all_text <-c(" ")
for(i in 1:length(all_image)){
fname<-paste(imagepath,all_image[i],sep="")
all_text[i] <- image_read(fname) %>%
image_resize("2000x") %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
tesseract::ocr()
print(i)
print(all_text[i])
}