Tesseract is an OCR (Optical Character Recognition) engine whose development is funded by Google since 2006.
As of version 11.10, Ubuntu still comes with Tesseract 2.04, which only supports 7 recognition languages. However, Tesseract 3.0 (released in Sept 2010) supports a total of 29 recognition languages. This guide will help you get Tesseract 3.01 working on Ubuntu 11.10
Installation instructions
- Download and extract Tesseract 3.01:
wget http://tesseract-ocr.googlecode.com/files/tesseract-3.01.tar.gz
tar zxvf tesseract-3.01.tar.gz
- Install the Leptonica image processing library:
sudo apt-get install libleptonica-dev
- Compile:
./autogen.sh
./configure
make
Note: make check fails in java/ with: No rule to make target `check'. Stop.
- Install:
sudo make install
sudo ldconfig
- Install recognition languages:
wget http://tesseract-ocr.googlecode.com/files/ell.traineddata.gz
gzip -d ell.traineddata.gz
sudo mv ell.traineddata /usr/local/share/tessdata