OCRopus - open source document analysis and OCR system (www.ocropus.org)

Version 0.2


This file contains information for building OCRopus on a Linux system.
For differences on other platforms, please have a look at:
http://groups.google.com/group/ocropus/web

Throughout the file, we assume that `sudo' is used to get root privileges; 
adapt those commands if you use a different method.


--------------------------------------------------------------------------------
Contents
--------------------------------------------------------------------------------

    * Requirements
    * Building OCRopus
    * Optional Software
    * Building Python Extension


--------------------------------------------------------------------------------
Requirements
--------------------------------------------------------------------------------

The following software needs to be installed for compiling and running OCRopus:
    * jam           (Perforce or ftjam, not bjam!)
    * libpng-dev    (or equivalent)
    * libjpeg-dev   (or equivalent)
    * libtiff-dev   (or equivalent)

It is a good idea to install Tesseract also, although it's technically possible
to compile OCRopus without it.

The 2.03 release of Tesseract has a bug. We have a patch for it, it's called
tesseract-2.03-patch.diff and located in the top-level OCRopus directory.
So the commands to install Tesseract 2.03 might look like this:

    wget http://tesseract-ocr.googlecode.com/files/tesseract-2.03.tar.gz
    tar xzf tesseract-2.03.tar.gz
    cd tesseract-2.03
    wget http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
    tar xzf tesseract-2.00.eng.tar.gz             # or other language packages
    patch -p1 <../ocropus-0.2/tesseract-2.03-patch.diff     # check this path!
    ./configure    # CXXFLAGS="-fPIC -O2" ./configure if you want Python later
    make
    sudo make install                                 # installs in /usr/local

The installation will finish with an error message about having no install
target in java/ subdirectory. That's another bug in 2.03 - just ignore it.

Alternatively, you can use SVN:

    svn co http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
    cd tesseract-ocr
    ./configure    # CXXFLAGS="-fPIC -O2" ./configure if you want Python later
    make
    sudo make install                                # installs in /usr/local

After Tesseract is installed, check it by typing

    tesseract phototest.tif out
    cat out.txt

If out.txt starts with "This is a lot of 12 point text..." then Tesseract works.
Please note that if it doesn't work for you, we most likely can't really help,
but the developers of Tesseract (http://code.google.com/p/tesseract-ocr/) most
likely can.


--------------------------------------------------------------------------------
Building OCRopus
--------------------------------------------------------------------------------

After installing the needed software (see above) go to the OCRopus release
directory and run:
    ./configure    # CXXFLAGS="-fPIC -O2" ./configure if you want Python later
    jam             # do not use boost jam!

You can find more information about jam (including how to get it) here:
http://freetype.sourceforge.net/jam/index.html
There are also `jam' and `ftjam' packages available in many Linux distributions 
in the usual way, for example, "apt-get install ftjam" on Ubuntu.

You can adjust the build process with the following options to ./configure:
  --with-tesseract=...    path to tesseract if not installed in /usr/local/.
                          Should be the same directory as was given to
                          tesseract's configure script through --prefix=
  --without-fst           disable OpenFST language modelling
  --without-aspell        disable aspell as dictionary
  --without-SDL           disable SDL (graphical debugging for ocroscript)
  --with-leptonica        enable Lua bindings for Leptonica


--------------------------------------------------------------------------------
Optional Software
--------------------------------------------------------------------------------

This software is not used in the OCRopus's pass-all-to-Tesseract mode,
but can be useful for experimentation through Lua.

For interactive mode of ocroscript, install:
    * libedit-dev

A library for FST handling, used in a few scripts:
    * OpenFST (http://www.openfst.org)

Spellchecking (necessary for 2-pass adaptation, but it's not the default mode):
    * libaspell-dev (or equivalent)
    * aspell-en     (or equivalent)

Without aspell, OCRopus needs a different UTF-8 encoded word list.
By default, it will look into /usr/share/dict/words; you can supply
a different location through the "wordlist" environment variable.
The file `data/words/en-us' does the job.

For advanced graphical debugging, install the latest versions of:
    * libsdl-dev, libsdl-gfx-dev, libsdl-image-dev

A nice image handling library that we have bindings to:
    * leptonica


--------------------------------------------------------------------------------Building Python Extension
--------------------------------------------------------------------------------

First of all, both Tesseract and OCRopus should be configured like this:
    CXXFLAGS="-fPIC -O2" ./configure

That should do the trick on a "standard" Linux. Then go to the python-binding/
subdirectory in the OCRopus release directory and use the usual Python way
of handling packages, which is
    python setup.py build
    python setup.py install

If it worked, you'll be able to "import ocropus" from Python.
