Tesseract 3.02

Introduction
Tesseract is probably the most widely used open source OCR application. The information here is mainly based on http://code.google.com/p/tesseract-ocr/wiki/ReadMe, the Tesseract manual page http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/tesseract.1.html and the FAQ http://code.google.com/p/tesseract-ocr/wiki/FAQ. The description applies to Tesseract 3.02.

Install software
There are two packages to install, the engine itself, and the training data for a language.

Linux
Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. E.g. on a recent ubuntu or debian system, simply

will install the program.

Packages are also generally available for language training data (search the repositories,) but if not you will need to download the appropriate training data at http://code.google.com/p/tesseract-ocr/downloads/list, unpack it, and copy the .traineddata file into the 'tessdata' directory, probably /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata, depending on your distribution.

If Tesseract isn't available for your distribution, or you want to use a newer version than is available, you can compile your own (cf. http://code.google.com/p/tesseract-ocr/wiki/Compiling).

Note that older versions of Tesseract only supported processing TIFF files and their language training data format is incompatible with the one which is used in 3.0.x.

Mac OS X
The easiest way to install Tesseract is through homebrew (http://brew.sh). Once homebrew is installed, you can install Tesseract by running the command: brew install tesseract.

If you want to use language training data not included with the homebrew package, download the appropriate training data, open it with Finder, and copy the .traineddata file into the /usr/local/Cellar/tesseract/ /share/tessdata directory.

Windows
An installer is available for Windows from our download page. This includes the English training data.

If you want to use another language, download the appropriate training data, unpack it using 7-zip (http://www.7-zip.org/), and copy the .traineddata file into the 'tessdata' directory, probably

Other Platforms
Tesseract may work on more exotic platforms too. You can either try compiling it yourself, or take a look at the list of other projects using Tesseract.

System requirements
Tesseract has a small footprint and will run on most recent hardware, even on mobile devices.

Documentation
Most relevant documentation can be found at the project website, http://code.google.com/p/tesseract-ocr/.

Input IMAGE formATS
According to the manual page, most image file formats (anything readable by the Leptonica image processing library) are supported. The Leptonica project page (http://code.google.com/p/leptonica/) lists at least jpg, png, tiff, bmp, pnm, gif, ps, pdf and webp.

Supported languages
Currently supported languages for version 3.02 are: Afrikaans, Albanian, Arabic, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Cherokee, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Esperanto, Estonian, Finnish, Frankish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Italian (Old), Japanese, Kannada, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Maltese, Middle English (1100-1500), Middle French (ca. 1400-1600), Norwegian, Polish, Portuguese, Romanian, Serbian (Latin), Slovakian, Slovenian, Spanish, Spanish (Old), Swahili, Swedish, Tagalog, Tamil, Telugu.

Language data can be downloaded at http://code.google.com/p/tesseract-ocr/downloads/list. The uncompressed trained data should be copied to the TESSDATA directory.

Tesseract 3.0.2 supports recognitions of images containing text in more than one language. Users can specify several languages and Tesseract will use the most accurate recognition as a result. Users need to keep in mind that recognition of pages in several languages last much longer than in case of one language profile.

Limitations
The fact that your image format is supported and your language is implemented does not necessarily mean that your recognition results will be satisfactory. The main reasons for suboptimal results are


 * Poor quality images, for instance low-resolution black and white images from old microfilms
 * Degraded documents (warped, unclear printing, damaged, …)
 * Font shapes unknown to the engine
 * Your language may be listed as supported, but the actual language in your documents may be incompatible with the implemented language support, if it contains specific terminology, historical or regional language.

Extending language support
One of the peculiarities of Tesseract is that glyph shape training data and language support data are tied up. This means that compiled word lists are part of the trained data bundle. A limited amount of words can be added without building a new data package, as a user word list.

Otherwise, one has to retrain the engine (cf. relevant section). A workaround for the entanglement of language and font data is as follows. Put the trained data file for your language in a separate directory. Now changedir to that directory. Assume the trained data file you start from is LANG.traineddata.


 * 1) Unpack trained data combine_tessdata –u traineddata_file LANG.
 * 2) Compile a word list to dawg formatwordlist2dawg your_word_list new_dawg_file LANG.unicharset
 * 3) Replace the word_dawg cp new_dawg_file LANG.word-dawg
 * 4) Repack the trained datacombine_tessdata LANG.
 * 5) Install your file LANG.traineddata by copying it to the tesseract data directory.

Output formats
There are two possible full text output formats: plain text and hOCR. '''hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself. Embedding this data into text in the standard HTML format is used to achieve that goal. Both are not entirely suitable for deployment in digital libraries, where one typically prefers XML-based solutions. Conversion of hOCR to ALTO or direct ALTO output is an obvious desideratum. No such utility seems to be available.

Another output format, which is relevant in the training process, is the box format, which gives bounding boxes for each recognized character.

Running OCR
The manual page is at http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html.

Options
imagename

The name of the input image. Most image file formats (anything readable by Leptonica) are supported.

outbase 

The basename of the output file (to which the appropriate extension will be appended). By default the output will be named outbase.txt.

-l lang

The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)

-psm N

Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

-v

Returns the current version of the Tesseract(1) executable.

configfile 

The name of a config to use. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value. Interesting config files include:

hocr - Output in hOCR format instead of as a text file. If this configuration file is not present, you can create and use a plain text file containing the line tessedit_create_hocrT

Nota Bene: The options -l lang and -psm N must occur before any configfile.

Training Tesseract
Tesseract is retrainable. Documentation on the training process is available at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3. A shell script implementing the training process is available in the appendix.

Though this takes care of the purely technical part of the process, it defines a way of compiling training data to Tesseract format rather than an approach to developing it with optimal recognition results.

The basic requirements for training a font/language combination are:


 * 1) A combination of a (usually black and white) page image and a text file (box format) listing containing a number of lines with a character and the bounding box of an occurrence of that character in the page image. A box file contains lines like:


 * 1) Word lists: a list of frequent words and a broader list aiming at a more comprehensive coverage of the language
 * 2) Some small configuration files.

From this, command line utilities supplied with the basic Tesseract distribution can build a new trained data bundle. An example script for the process is given in the appendix.

This does not yet give us guidelines on how to proceed when we want to train for a specific collection. Ideally, one might hope that a set of images together with a set of ground truth transcriptions might be enough to train the engine. In practice, this is not so easy.

First, since there is no practical tool available to align the images with a plain text ground truth transcription, we need to enhance our ground truth with character bounding boxes or even bounding polygons.Moreover, there are some restrictions: character bounding boxes should not overlap; there should only be one font per training image/box file pair.More importantly, while one might expect that damaged instances of a glyph shape might also be informative to the character classification process, this appears not to be the case. Tesseract assumes its training material to represent prototypical shapes rather than possibly noisy instances.

Several solutions have been developed to bridge the gap between ground truth data and a Tesseract trained data bundle. User interfaces have been developed to create (or manually correct automatically created) box files, for instance JtessBoxEditor (http://vietocr.sourceforge.net/training.html) or web-based Cutouts (http://wlt.synat.pcss.pl/cutouts).

We also mention two approaches based on the PAGE XML ground truth format (http://www.primaresearch.org/tools.php). This format allows user-friendly development of ground truth material with the option of specifying coordinates for text regions, lines, words and individual glyphs with the Aletheia tool (ibidem), which can be used freely for non-commercial purposes.


 * The Poznan Supercomputing and Networking Center (PSNC) has developed two handy tools to automatically develop training data starting from an image and a PAGE XML ground truth file with glyph coordinate information. The first tool cuts out the glyphs from the image, creating individual images. After this stage, noisy character images may be removed. The second tool recombines the glyphs into a “cleaner” input image which can be used in the Tesseract training process, and also generates the required box file. The use of these tools is documented in the file IC-Tesseracttrainingworkflow-200913-0919-9296.pdf, included in the training package.
 * In the EMOP project, a tool Franken+ has been produced. Provided the binarised image and the resulting XML file generated with Aletheia, Franken+ extracts individual TIFF images for each letter blocked-out using Aletheia, giving the user the opportunity to hand-pick the best instances of each letter (thus producing a "font" consisting of only hand-picked images). Using this font, Franken+ can then create synthetic TIFF images of text "printed" using this font, with corresponding BOX files, which are then used to train Tesseract OCR engine in order to OCR images of documents printed with the relevant historic font. Using these synthetic images and their corresponding BOX files, Franken+ then automates the Tesseract font training process and allows a user to test this font.

For a comparison between the FineReader and the Tesseract OCR trainability, cf. for instance the case study http://lib.psnc.pl/dlibra/doccontent?id=358, which we include with the current SUCCEED training materials for Tesseract.

Building a TESSERACT TRAINED DATA BUNDLE
This bash script assumes the presence, in the current directory, of


 * 1) A file ‘files.lst’ containing the base names of the images and box files (such that the list contains, for instance the line “image1” when image1.png and image1 are the respective image and box files)
 * 2) For each line li in this file containing a string si, an image file si.$EXTENSION and a box file si.box.
 * 3) A word list named words.list