OCR


OCR - Optical Character Recognition

OCR is a technology that allows you to convert scanned images of text into plain text. This enables you to save space, edit the text and search/index it.

Available OCR tools

The Ubuntu Universe repositories contain the following OCR tools:

  • fuzzyocr - spamassassin plugin to check image attachments

  • gocr - a command line OCR

  • libhocr0 - Hebrew OCR

  • ocrad - OCR program

  • ocrfeeder - document layout analysis and optical character recognition system

  • ocropus - document analysis and OCR system

  • tesseract-ocr - command line OCR

The Ubuntu multiverse respositories also contain:

OCRFeeder

While Tesseract and CuneiForm are the most accurate, under Linux now they lack graphical interface (GUI), which is a very important usability feature for a typical desktop user.

OCRFeeder suite provides handy GUI, which is basically a front-end for some image, OCR and text tools (like unpaper or spellchecker). It doesn't make character recognition itself, but uses other OCR apps (through so called "OCR engines" settings) instead. It has predefined settings for Tesseract, CuneiForm, GOCR and Ocrad, so the user doesn't need to know how to invoke them. One has only to install in Ubuntu its OCR engines of choice - one or more - and then detect them in OCRFeeder settings. It is possible to add other engines and to change these options manually, there can be more than one engine entry using the same application. Main OCRFeeder window allows to choose on the fly which engine to use for a particular area, there is also setting for making one engine the default choice.

As of version 0.7.3 there is no easy way to choose a language of a recognized text. In case of Tesseract and CuneiForm one has to add "-l" switch followed with a proper language/script code (for example "-l pol" for Polish or "-l dan-frak" for Danish Fraktur) to the given engine's settings. One can even make multiple separate entries with settings for each desired combination of language and application (and naming them like "Traditional Chinese - Tesseract", "German - Tesseract" and "German - CuneiForm", because we may want the same language to be recognized by different applications) to select them later from the pull down "OCR engines" list in the main OCRFeeder window.

OCRFeeder can also be run in pure command line mode:

$ ocrfeeder-cli -i input1.jpg input2.jpg -f html -o output.htm

Tesseract

Introduction

Arguably the one producing the best (most accurate) results is Tesseract. It is a technology initially developed by HP Labs between 1985 and 1995, then they open-sourced it in 2005.

Version 2.x did not support layout analysis, so multi-column text, images, equations etc. should give you a garbled text output. Also, it only supported TIFF images as input. Version 3.x includes layout analysis, and, if compiled with Leptonica, supports all image formats Leptonica supports.

Originally Tesseract could recognize text in English only; version 2.x extended it to 7 different languages: English, German, French, Italian, Spanish, Brazilian Portuguese and Dutch. You can install more than one dictionary if needed. Newer versions can recognize text in following languages/scripts (loosely based on ISO 963-2):

  • ara - Arabic
  • eng - English
  • bul - Bulgarian
  • cat - Catalan
  • ces - Czech
  • chi_sim - Chinese [Simplified]
  • chi_tra - Chinese [Traditional]
  • dan - Danish
  • dan-frak - Danish [Fraktur]
  • ger - German
  • ell - Greek [Modern]
  • fin - Finnish
  • fra - French
  • heb - Hebrew
  • hrv - Croatian
  • hun - Hungarian
  • ind - Indonesian
  • ita - Italian
  • jpn - Japanese
  • kor - Korean
  • lav - Latvian
  • lit - Lithuanian
  • nld - Dutch
  • nor - Norwegian
  • osd - [Orientation and Script Detection]
  • pol - Polish
  • por - Portuguese
  • ron - Romanian
  • rus - Russian
  • slk - Slovak
  • slk-frak - Slovak [Fraktur]
  • slv - Slovenian
  • spa - Spanish
  • srp - Serbian
  • swe - Swedish
  • tgl - Tagalog
  • tha - Thai
  • tur - Turkish
  • ukr - Ukrainian
  • vie - Vietnamese

Usage

The current version of Tesseract in the Ubuntu repository is a command-line-only tool. After successful installation, the command to use is tesseract <path to image> <basename of output file>. Tesseract will automatically give the output file a .txt extension. If you have installed the language specific data files from one of the tesseract-ocr-??? packages, you can give an -l option followed by the language code.

For versions of Tesseract older then 3 it is critical that the image is in Tagged Image File Format and has a ".tif" extension and not a ".tiff" extension. The command line should look like this example:

$ tesseract ~/input.tif output

Where input.tif is the document to be converted located in your home folder and output is the document that Tesseract will create as output.txt. The .txt file extension will be added by Tesseract automatically.

Preparing images for old versions of Tesseract

Tesseract 2.x is not very flexible about the format of its input images. It will only accept TIFF images. According to user reports, compressed TIFF images are quite problematic, and the same goes for grey-scale and colour images. So you're better of with single-bit uncompressed TIFF images.

The process to prepare them with GIMP is very simple:

  1. Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode.

  2. Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value.

  3. Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering.

  4. Save the image in TIFF format with a .tif extension.

Using Tesseract With a Multi Page PDF

Often, scanned documents are stored as a raster image in a large PDF document. Using ImageMagick, the individual pages can then be extracted as TIFF files for processing using Tesseract. The following script can help automate this process:

#!/bin/sh
STARTPAGE=6 # set to pagenumber of the first page of PDF you wish to convert
ENDPAGE=255 # set to pagenumber of the last page of PDF you wish to convert
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)

touch $OUTPUT
for i in `seq $STARTPAGE $ENDPAGE`; do
    convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif
    echo processing page $i
    tesseract page.tif tempoutput
    cat tempoutput.txt >> $OUTPUT
done

After running this script, the OCR text should be contained in book.txt (or whatever you set $OUTPUT to be).

CuneiForm

Introduction

CuneiForm is another OCR system, which was originally developed and open-sourced by Cognitive Technologies.

Windows version, which has its own graphical interface, can be run with some results under Wine. Its Linux port is being developed on Launchpad and while it currently doesn't have its own GUI, CuneiForm can be successfuly run from within OCRFeeder graphical interface.

CuneiForm recognizes Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, German, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Russian-English bilingual, Serbian, Slovene, Spanish, Swedish, Turkish, and Ukrainian text.

List of language/script codes (loosely based on ISO 963-2):

  • bul - Bulgarian
  • cro - Croatian
  • cze - Czech
  • dan - Danish
  • dut - Dutch
  • est - Estonian
  • frn - French
  • ger - German
  • hun - Hungarian
  • ita - Italian
  • lat - Latvian
  • lit - Lithuanian
  • pol - Polish
  • por - Portuguese
  • rom - Romanian
  • rus - Russian
  • ser - Serbian
  • slo - Slovene
  • spa - Spanish
  • swe - Swedish
  • tur - Turkish
  • ukr - Ukrainian

From JPEG to TXT

The following is an anecdotal example. Had success translating some image/jpeg screenshots of an internet message board into useful text/plain files with:

#!/bin/bash
if [ "$1" ] && [ -e "$1" ]; then
  TMPF=$(mktemp XXXXXXXX.tif)
  DEST="$2"
  if [ ! "$DEST" ]; then
    DEST="${1%.*}.txt"
    if [ -e "$DEST" ]; then
      echo "$DEST already exists; please provide a new textfile name" >&2
      exit 1
    fi
  fi
  /usr/bin/convert "$1" -colorspace Gray -depth 8 -resample 200x200 $TMPF \
  && /usr/bin/cuneiform -o "$DEST" $TMPF
  EX=$?
  /bin/rm -f $TMPF
  [ $EX -eq 0 ] && [ "$TERM" ] && echo "created $DEST"
  exit $EX
else
  echo "Usage: $0 imagefile [textfile]" >&2
  echo " creates a plain text file with the text found in imagefile" >&2
  exit 1
fi

OCR on a Multi Page PDF

If you have a multi-page PDF file and want to make it searchable you should use one of these following methods.

gscan2pdf

This is probably the easiest way of doing this. Gscan2pdf is a graphical tool which lets you not only scan files, but also import files and perform OCR on them.

  • Install gscan2pdf, either from Ubuntu Software Center or running this command in a terminal:

    $ sudo apt-get install gscan2pdf

  • Run gscan2pdf
  • Import the pdf (Ctrl+i)
  • Choose Tools=>OCR

  • Save (Ctrl+s)

It may take some time if you have many pages. This is normal.

OCRFeeder

OCRFeeder can do this too. Sadly it doesn't seem to work very well yet.

pdfocr

pdfocr is a script which both performs OCR on multi-page PDF files, and also embeds the text back into the PDF file as a searchable text layer. It can use either tesseract or cuneiform as the OCR engine. The script itself can be obtained from Github or from the PPA. To use, simply enter this command in a terminal:

pdfocr -i input.pdf -o output.pdf

Further Reading


CategoryGraphicsApplications

OCR (last edited 2015-03-31 12:07:20 by p4FC25736)