||<>|| = OCR - Optical Character Recognition = OCR is a technology that allows you to convert scanned images of text into plain text. This enables you to save space, edit the text and search/index it. = Available OCR tools = The Ubuntu Universe repositories contain the following OCR tools: * [[http://fuzzyocr.own-hero.net/|fuzzyocr]] - spamassassin plugin to check image attachments * [[http://jocr.sourceforge.net/|gocr]] - a command line OCR * [[http://hocr.berlios.de/|libhocr0]] - Hebrew OCR * [[http://www.gnu.org/software/ocrad/ocrad.html|ocrad]] - OCR program * [[http://live.gnome.org/OCRFeeder|ocrfeeder]] - document layout analysis and optical character recognition system * [[http://code.google.com/p/ocropus/|ocropus]] - document analysis and OCR system * [[http://code.google.com/p/tesseract-ocr/|tesseract-ocr]] - command line OCR The Ubuntu multiverse respositories also contain: * [[http://launchpad.net/cuneiform-linux/|cuneiform]] - multi-language OCR system == OCRFeeder == While Tesseract and CuneiForm are the most accurate, under Linux now they lack graphical interface (GUI), which is a very important usability feature for a typical desktop user. OCRFeeder suite provides handy GUI, which is basically a front-end for some image, OCR and text tools (like unpaper or spellchecker). It doesn't make character recognition itself, but uses other OCR apps (through so called "OCR engines" settings) instead. It has predefined settings for Tesseract, CuneiForm, GOCR and Ocrad, so the user doesn't need to know how to invoke them. One has only to install in Ubuntu its OCR engines of choice - one or more - and then detect them in OCRFeeder settings. It is possible to add other engines and to change these options manually, there can be more than one engine entry using the same application. Main OCRFeeder window allows to choose on the fly which engine to use for a particular area, there is also setting for making one engine the default choice. As of version 0.7.3 there is no easy way to choose a language of a recognized text. In case of Tesseract and CuneiForm one has to add "-l" switch followed with a proper language/script code (for example "-l pol" for Polish or "-l dan-frak" for Danish Fraktur) to the given engine's settings. One can even make multiple separate entries with settings for each desired combination of language and application (and naming them like "Traditional Chinese - Tesseract", "German - Tesseract" and "German - CuneiForm", because we may want the same language to be recognized by different applications) to select them later from the pull down "OCR engines" list in the main OCRFeeder window. OCRFeeder can also be run in pure command line mode: {{{$ ocrfeeder-cli -i input1.jpg input2.jpg -f html -o output.htm}}} == Tesseract == === Introduction === Arguably the one producing the best (most accurate) results is Tesseract. It is a technology initially developed by HP Labs between 1985 and 1995, then they open-sourced it in 2005. Version 2.x did not support layout analysis, so multi-column text, images, equations etc. should give you a garbled text output. Also, it only supported TIFF images as input. Version 3.x includes layout analysis, and, if compiled with Leptonica, supports all image formats Leptonica supports. Originally Tesseract could recognize text in English only; version 2.x extended it to 7 different languages: English, German, French, Italian, Spanish, Brazilian Portuguese and Dutch. You can install more than one dictionary if needed. Newer versions can recognize text in [[https://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/Makefile.in|following languages/scripts]] (loosely based on [[http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes|ISO 963-2]]): * ara - Arabic * eng - English * bul - Bulgarian * cat - Catalan * ces - Czech * chi_sim - Chinese [Simplified] * chi_tra - Chinese [Traditional] * dan - Danish * dan-frak - Danish [Fraktur] * ger - German * ell - Greek [Modern] * fin - Finnish * fra - French * heb - Hebrew * hrv - Croatian * hun - Hungarian * ind - Indonesian * ita - Italian * jpn - Japanese * kor - Korean * lav - Latvian * lit - Lithuanian * nld - Dutch * nor - Norwegian * osd - [Orientation and Script Detection] * pol - Polish * por - Portuguese * ron - Romanian * rus - Russian * slk - Slovak * slk-frak - Slovak [Fraktur] * slv - Slovenian * spa - Spanish * srp - Serbian * swe - Swedish * tgl - Tagalog * tha - Thai * tur - Turkish * ukr - Ukrainian * vie - Vietnamese === Usage === The current version of Tesseract in the Ubuntu repository is a command-line-only tool. After successful installation, the command to use is {{{tesseract }}}. Tesseract will automatically give the output file a .txt extension. If you have installed the language specific data files from one of the {{{tesseract-ocr-???}}} packages, you can give an {{{-l}}} option followed by the language code. For versions of Tesseract older then 3 it is critical that the image is in Tagged Image File Format and has a ".tif" extension and not a ".tiff" extension. The command line should look like this example: {{{$ tesseract ~/input.tif output}}} Where {{{input.tif}}} is the document to be converted located in your home folder and {{{output}}} is the document that Tesseract will create as {{{output.txt}}}. The {{{.txt}}} file extension will be added by Tesseract automatically. === Preparing images for old versions of Tesseract === Tesseract 2.x is not very flexible about the format of its input images. It will only accept TIFF images. According to user reports, compressed TIFF images are quite problematic, and the same goes for grey-scale and colour images. So you're better of with single-bit uncompressed TIFF images. The process to prepare them with GIMP is very simple: 1. Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode. 2. Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value. 3. Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering. 4. Save the image in TIFF format with a .tif extension. === Using Tesseract With a Multi Page PDF === Often, scanned documents are stored as a raster image in a large PDF document. Using [[ImageMagick]], the individual pages can then be extracted as TIFF files for processing using Tesseract. The following script can help automate this process: {{{#!bash #!/bin/sh STARTPAGE=6 # set to pagenumber of the first page of PDF you wish to convert ENDPAGE=255 # set to pagenumber of the last page of PDF you wish to convert SOURCE=book.pdf # set to the file name of the PDF OUTPUT=book.txt # set to the final output file RESOLUTION=600 # set to the resolution the scanner used (the higher, the better) touch $OUTPUT for i in `seq $STARTPAGE $ENDPAGE`; do convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif echo processing page $i tesseract page.tif tempoutput cat tempoutput.txt >> $OUTPUT done }}} After running this script, the OCR text should be contained in {{{book.txt}}} (or whatever you set {{{$OUTPUT}}} to be). == CuneiForm == === Introduction === CuneiForm is another OCR system, which was originally developed and open-sourced by Cognitive Technologies. Windows version, which has its own graphical interface, can be run [[http://appdb.winehq.org/objectManager.php?sClass=version&iId=10327|with some results]] under [[Wine]]. Its Linux port is being developed on [[http://launchpad.net/cuneiform-linux|Launchpad]] and while it currently doesn't have its own GUI, CuneiForm can be successfuly run from within OCRFeeder graphical interface. CuneiForm recognizes Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, German, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Russian-English bilingual, Serbian, Slovene, Spanish, Swedish, Turkish, and Ukrainian text. List of [[http://bazaar.launchpad.net/~jpakkane/cuneiform-linux/trunk/files/head:/datafiles/|language/script]] codes (loosely based on [[http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes|ISO 963-2]]): * bul - Bulgarian * cro - Croatian * cze - Czech * dan - Danish * dut - Dutch * est - Estonian * frn - French * ger - German * hun - Hungarian * ita - Italian * lat - Latvian * lit - Lithuanian * pol - Polish * por - Portuguese * rom - Romanian * rus - Russian * ser - Serbian * slo - Slovene * spa - Spanish * swe - Swedish * tur - Turkish * ukr - Ukrainian === From JPEG to TXT === The following is an anecdotal example. Had success translating some image/jpeg screenshots of an internet message board into useful text/plain files with: {{{#!bash #!/bin/bash if [ "$1" ] && [ -e "$1" ]; then TMPF=$(mktemp XXXXXXXX.tif) DEST="$2" if [ ! "$DEST" ]; then DEST="${1%.*}.txt" if [ -e "$DEST" ]; then echo "$DEST already exists; please provide a new textfile name" >&2 exit 1 fi fi /usr/bin/convert "$1" -colorspace Gray -depth 8 -resample 200x200 $TMPF \ && /usr/bin/cuneiform -o "$DEST" $TMPF EX=$? /bin/rm -f $TMPF [ $EX -eq 0 ] && [ "$TERM" ] && echo "created $DEST" exit $EX else echo "Usage: $0 imagefile [textfile]" >&2 echo " creates a plain text file with the text found in imagefile" >&2 exit 1 fi }}} = OCR on a Multi Page PDF = If you have a multi-page PDF file and want to make it searchable you should use one of these following methods. == gscan2pdf == This is probably the easiest way of doing this. Gscan2pdf is a graphical tool which lets you not only scan files, but also import files and perform OCR on them. *Install gscan2pdf, either from Ubuntu Software Center or running this command in a terminal: {{{$ sudo apt-get install gscan2pdf}}} *Run gscan2pdf *Import the pdf (Ctrl+i) *Choose Tools=>OCR *Save (Ctrl+s) It may take some time if you have many pages. This is normal. == OCRFeeder == OCRFeeder can do this too. Sadly it doesn't seem to work very well yet. == pdfocr == pdfocr is a script which both performs OCR on multi-page PDF files, and also embeds the text back into the PDF file as a searchable text layer. It can use either tesseract or cuneiform as the OCR engine. The script itself can be obtained from [[http://github.com/gkovacs/pdfocr/raw/master/pdfocr.rb|Github]] or from the [[http://launchpad.net/~gezakovacs/+archive/pdfocr|PPA]]. To use, simply enter this command in a terminal: {{{pdfocr -i input.pdf -o output.pdf}}} = Further Reading = * [[http://www.linuxjournal.com/article/9676|A LinuxJournal article on Tesseract]] ---- CategoryGraphicsApplications