Extracting Invoice Text From Image (PDF,JPEG,PNG)

MyDatada · April 21, 2022, 2:33pm

Hello All,
I am trying to read an .PNG, JPEG image and need to extract the text from that image. The extracted information is Invoices image I want to write those information in Excel sheet. I re-used the code published in this provider:
ExtractTable - API to convert image to excel, extract tables from PDF , here is the complete code:

## Load required R packages (must be installed first)
install.packages(c("magrittr", "jsonlite", "httr"))
require(magrittr)
require(jsonlite)
require(httr)


# Main Functions

## Parse Server Response
parseResponse <- function(server_resp) {return(fromJSON(content(server_resp, "text", encoding="UTF-8")))}


## Function to Check credits usage
check_credits <- function(api_key) {
  validate_endpoint = 'https://validator.extracttable.com'
  return(content(GET(url = validate_endpoint, add_headers(`x-api-key` = api_key)), as = 'parsed', type = 'application/json'))
}

## Function to Retrieve the result by JobId
retrieve_result <- function(api_key, job_id) {
  retrieve_endpoint = "https://getresult.extracttable.com"
  return(
    GET(
      url = paste0(retrieve_endpoint, "/?JobId=", job_id),
      add_headers(`x-api-key` = api_key)
    )
  )
}


## Function to trigger a file for extraction
proces_file <- function(api_key, filepath) {
  trigger_endpoint = "https://trigger.extracttable.com"
  return (
    POST(
      url = trigger_endpoint,
      add_headers(`Content-Type`="multipart/form-data", `x-api-key` = api_key),
      body = list(input = upload_file(filepath))
    )
  )
}


## Function to extract all tables from the input file
ExtractTable <- function(filepath, api_key) {
  server_response <- proces_file(api_key, filepath)
  parsed_resp = parseResponse(server_response)
  
  
  # Wait for a maximum of 5 minutes to finish the trigger job
  # Retries every 20 seconds
  max_wait_time = 5*60
  retry_interval = 20
  while (parsed_resp$JobStatus == 'Processing' & max_wait_time >= 0) {
    max_wait_time = max_wait_time - retry_interval
    print(paste0("Job is still in progress. Let's wait for ", retry_interval, " seconds"))
    Sys.sleep(retry_interval)
    server_response <- retrieve_result(api_key, job_id=parsed_resp$JobId)
    parsed_resp = parseResponse(server_response)
  }
  
  ### Parse the response for tables
  et_tables <- content(server_response, as = 'parsed', type = 'application/json')
  
  all_tables <- list()
  
  if (tolower(parsed_resp$JobStatus) != "success") {
    print(paste0("The processing was NOT SUCCESSFUL Below is the complete response from the server"))
    print(parsed_resp)
    return(all_tables)
  }
  
  ### Convert the extracted tabular JSON data as a dataframe for future use
  ### Each data frame represents one table
  for (i in 1:length(et_tables$Table)) {
    all_tables[[i]] <- sapply(et_tables$Tables[[i]]$TableJson, unlist) %>% t() %>% as.data.frame()
  }
  
  return(all_tables)
  
} #end of function



# Usage

## Intialize valid API key received from https://extracttable.com
api_key = ""

# Validate or check credits of the API key
credits <- check_credits(api_key = api_key)$usage


input_location = "E:/OCR Test/Test Bill.jpeg"
Excel_location = "E:/OCR Test/"

# Trigger the job for processing and get results as an array of dataframes
# Each data frame represents one table
results <- ExtractTable(api_key = api_key, filepath = input_location)
Size<-length(results)
i=1
for(i in 1:Size) {
  # No<-as.character(i)
  write.xlsx2(results[[i]], paste(Excel_location, "data_all.xlsx"), row.names = FALSE, sheetName = paste("Sheet", as.character(i), sep=""), append = TRUE)  # Append other data frames
}

I have 3 Questions on this regard

Q.1 We have a webpage for uploading the Image file to be extracted , on this page
https://forms.pabbly.com/form/share/6BdC-483294

How can I embed shiny inside the page to get the uploaded Image and then put in the variable (Input_location) in R code? Also how to output the Excel file to be downloaded by the user of the page?

Q.2 Our customers need the output in excel with a specific template format , how can you help me to arrange the output on the same format like the one on the screen-shot below?

Q.3 For the Arabic language I have the following code:

install.packages("tesseract")
library(tidyverse)
library(tesseract)
tesseract_info()

knitr::include_graphics("E:/OCR Test/Invoice.PNG")

textt3 <- tesseract::ocr(image = "E:/OCR Test/Invoice2.jpeg",
                         engine = tesseract("ara"))
cat(textt3)

How can we use the code of extracting Arabic words in the original code posted earlier?

Q.4 Any suggestion if I don't want to use the service (API Token) from the provider ,
And do my own homework to get the same results without using API Token?

system · May 12, 2022, 2:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.