OCR using tesseract, magick

For discussions related to modeling, machine learning and deep learning. Related packages include caret, modelr, yardstick, rsample, parsnip, tensorflow, keras, cloudml, and tfestimators.

Working on a Data extraction from Invoice pdf.

How can the background be made white so texts are easily read.
Also parameters to tune to get more accurate texts from invoice

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

*ibrary(magick)
library(tiff)
library(magrittr)
library(imager)

library(png)
library(tesseract)
library(stringr)
library(qdapRegex)

imagepath ="C:/OCR-Docs/"

all_image <- list.files(imagepath, pattern=".jpg",all.files=T, full.names=F, no.. = T)
all_image
eng <- tesseract("eng")

all_text <-c(" ")
for(i in 1:length(all_image)){
fname<-paste(imagepath,all_image[i],sep="")
all_text[i] <- image_read(fname) %>%
image_resize("2000x") %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
tesseract::ocr()
print(i)
print(all_text[i])
}

This is the exact code used for extracting the data out of invoice.
Please help me improve the accuracy of the text output also pre-process by changing background of invoices to white.

I think what you're looking for is local adaptive thresholding (via the magick::image_lat() function). Below is some code I used to process a few hundred pages that produced reasonable results. Note that I also used image_deskew()...tesseract is (I think) fairly sensitive to having straight lines be straight. The cropping is probably specific to my images, but the resolution of the images did seem to matter as I remember (this was a while ago, and results are probably better using the new version of tesseract).

library(magick)
img_files <- list.files(pattern = "[0-9]\\.png$", full.names = TRUE)
message("Cleaning image files...")
for(img_file in img_files) {
  message(img_file)
  img <- image_read(img_file) %>%
    image_crop(geometry_area(width = 1500, height = 2100, x_off = 150, y_off = 50)) %>%
    image_convert(colorspace = "gray") %>%
    image_negate() %>%
    # local adaptive thresholding
    image_lat(geometry = "20x10+5%") %>%
    image_negate() %>%
    image_deskew() %>%
    image_trim() %>%
    image_threshold() %>%
    image_border(geometry = "50x50", color = "white")

  image_write(img, str_replace(img_file, ".png$", ".clean.png"))
}

This code does not terminate properly.
The code is locked on processing the last image in the "img_files" and also causes R studio to restart on manually terminating the execution.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.