Revision 5 as of 2011-12-31 09:26:23

Clear message

Tesseract is an OCR (Optical Character Recognition) engine whose development is funded by Google since 2006.

As of version 11.10, Ubuntu still comes with Tesseract 2.04, which only supports 7 recognition languages. However, Tesseract 3.0 (released in Sept 2010) supports a total of 29 recognition languages. This guide will help you get Tesseract 3.01 working on Ubuntu 11.10

Installation instructions

  • Download and extract Tesseract 3.01:

wget http://tesseract-ocr.googlecode.com/files/tesseract-3.01.tar.gz

tar zxvf tesseract-3.01.tar.gz

  • Install the Leptonica image processing library:

sudo apt-get install libleptonica-dev

  • Compile:

./autogen.sh

./configure

make

Note: make check fails in java/ with: No rule to make target `check'. Stop.

  • Install:

sudo make install

sudo ldconfig

  • Install recognition languages:

wget http://tesseract-ocr.googlecode.com/files/ell.traineddata.gz

gzip -d ell.traineddata.gz

sudo mv ell.traineddata /usr/local/share/tessdata

See also