tags: ocr

The HTRC Extracted Features Dataset is a valuable resource for anyone interested in doing large-scale text analysis. Because of my work in Latin OCR, Latin volumes in HathiTrust are of particular interest to my research. Selecting a Latin volume that I knew of from data I was already working with, I noticed that the page-level “language” metadata was pretty bad, frequently detecting Portuguese as the page language when the majority of OCR tokens were very recognizably Latin. It seems like the language detection library used by HTRC isn’t trained against Latin, so I thought it might be useful to re-process page tokens with langid.

I noticed that for the Latin volume I selected, while the page-level language metadata was wrong, the volume-level language metadata was correct. Since I’m mostly interested in Latin volumes (and didn’t want to find 1.2TB of free space for the full set of basic features on all volumes), I decided to use the language specified in the bibliographic metadata as an initial filtering criterion. I did this by downloading the bulk bibliographic data export HathiTrust makes available, and filtering to Latin with:

awk -F $'\t' 'BEGIN {OFS = FS} { if ($19 == "lat")  print }' hathi_full_20160301.txt > hathi_latin_ids.tsv

I then downloaded the list of HTRC basic feature files, and filtered it down to just the Latin volumes I got from the bibliographic data with:

cut -f1,1 < hathi_latin_ids.tsv > latin_ids_only.txt
grep -F -f latin_ids_only.txt pd-basic-file-listing.txt > pd-basic-latin.txt

I could then rsync all these feature files with:1

rsync -av --files-from pd-basic-latin.txt data.sharc.hathitrust.org::pd-features/ latin/

Which gave me about 53GB of compressed JSON. I then wrote a short, dumb script to mash strings generated from the page tokens into langid and write the results to CSV:

Which I then ran with:

find ../latin -name '*.bz2' | parallel -u -j8 -X ../sharclangcountbz2.py

For anyone interested, here’s a link to a .tar.bz2 of the resulting CSV files (17.6MB).

We can then do things like find the majority page language for each volume and then see what the language distribution is:

find langid -name '*.csv' | while read csvfile; do echo "$(basename ${csvfile} .csv),$(cut -d, -f1,1 $csvfile|sort|uniq -c | sort -rn|head -1|sed -E 's/^ *[0-9]+ //g')"; done > toplangs.csv
cut -d, -f2,2 toplangs.csv|sort|uniq -c|sort -rn > lang-distribution.txt

Or do the same thing at the page level:

find langid -name '*.csv' -print0 | xargs -0 cat | cut -d, -f1,1 | sort | uniq -c | sort -rn > page-lang-distribution.txt

Note that the results still depend on the tokens we get from the OCR. Since we don’t restrict the language set used by langid, we wind up with some odd things like 7,512 volumes detected as “Luxembourgish”. Spot-checking these it seems like this (and other obscure languages) often winds up being a kind of proxy for “bad OCR”.

I’d also like to run this processing against the full feature dataset and release the results, though this may take a while longer and I’d like to incorporate any feedback I get from this work into that process. Watch this space!


  1. I strongly suggest checking that you’re using rsync 3.x if you’re going to do this sync yourself.