Tesseract OCR (Optical character recognition)

Optical character recognition (OCR)

Optical character recognition is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo)

Tesseract OCR

  • Tesseract is an open source OCR engine maintained by Google.
  • It can run in windows , Linux & mac
  • Written in C, C++ and it have good performance
  • We can add training data to improve the handwritten data
  • It can be deployed in amazon EBS, find the link for installing Tesseract on EBS 

URL : https://github.com/tesseract-ocr/tesseract

Installation

 

Running the Application

Command

tesseract image.tif output

  1. tesseract  – is the command to use tesseract .
  2. image.tif  – is the path to the image on which we are running OCR. Assuming that image.png is in same directory .
  3. output  – The output file name to stored text in an image , By default output.txt will be stored in the current directory.

You will get lot of spelling mistake, to avoid this issue Increase the image resolution, you use Image magick command line tool to increase the image resolution

Image magick command to increase the resolution

magick image.jpg -resize 10000 output_image.tif

Optical character recognition software list & its Comparison

Refer the below wiki URL

https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software