Tesseract is an OCR (Optical Character Recognition) engine whose development is funded by Google since 2006.

As of version 11.10, Ubuntu still comes with Tesseract 2.04, which only supports 7 recognition languages. However, Tesseract 3.0 (released in Sept 2010) supports a total of 29 recognition languages. This guide will help you get Tesseract 3.01 working on Ubuntu 11.10

Installation instructions

  • Download and extract Tesseract 3.01:

wget http://tesseract-ocr.googlecode.com/files/tesseract-3.01.tar.gz

tar zxvf tesseract-3.01.tar.gz

  • Install the Leptonica image processing library:

sudo apt-get install libleptonica-dev

  • Compile:

./autogen.sh

./configure

make

Note: make check fails in java/ with: No rule to make target `check'. Stop.

  • Install:

sudo make install

sudo ldconfig

  • Install recognition languages:

wget http://tesseract-ocr.googlecode.com/files/ell.traineddata.gz

gzip -d ell.traineddata.gz

sudo mv ell.traineddata /usr/local/share/tessdata

See also

Tesseract3 (last edited 2011-12-31 09:26:23 by 78)