Originally Published: 2014-11-19
A pull request I submitted to Homebrew to add a
--with-training-tools option to the
tesseract formula has now been accepted, so you should be able to just do
brew install --with-training-tools tesseract. Please submit any issues with the training tools under OS X to the Tesseract project on GitHub.
In my previous post I outlined getting Tesseract working for OCR of PDF’s on OS X. In this post, I’d like to document how to install and use the Tesseract training tools.
My first efforts at crudely getting the training tools built and installed were just adding the necessary
make commands to the Homebrew formula and reinstalling
--devel. However, this resulted in some bizarre problems in even getting the
text2image command to run: 1
$ text2image --list_available_fonts Error resolving name for ScrollView host --list_available_fonts:8461 Segmentation fault: 11 $ text2image localhost --list_available_fonts Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -Djava.library.path=. -cp ./ScrollView.jar:./piccolo2d-core-3.0.jar:./piccolo2d-extras-3.0.jar com.google.scrollview.ScrollView & wait" ScrollView: Waiting for server... Error: Could not find or load main class com.google.scrollview.ScrollView sh: line 0: kill: %1: no such job ScrollView: Waiting for server... ScrollView: Waiting for server... ScrollView: Waiting for server... ScrollView: Waiting for server... ScrollView: Waiting for server... ^C
I decided to try building against
--HEAD but got some link errors during the training build. After some more thorough hacking of the formula, I got something that built, linked, and apparently worked. You can see the formula here.
brew uninstall tesseract first to remove any existing install, you can build and install my version of the formula with:
brew install --training-tools --all-languages --HEAD https://raw.githubusercontent.com/ryanfb/homebrew/tesseract_training/Library/Formula/tesseract.rb
You should now be able to do e.g.:
fontconfig font locations and caching are a whole other nightmare, and I seem unable to get
text2image to respect/use the
--fonts_dir argument on OS X. Your best bet seems to be to install things as system/user fonts (e.g. copy into
~/Library/Fonts) and optionally run
fc-cache -frv to force a cache update.
Googling these problems lead to this unresolved thread on the tesseract-ocr mailing list, where I’ve actually copied the error example from as mine is now lost to the great scrollbuffer in the sky. ↩