This is a project for generating an edition-specific OCR training file for Kraken for Theodorus Gaza’s Attic paraphrase of the Iliad. By using the facing pages of the Iliad edition that are printed in the some font, we can quickly generate ground truth which can then (it is hoped) be used to train a model which can accurately OCR the Attic paraphrase.
Instructions
- Pick a non-transcribed page (❌) from the list below (you might also check that there are no open pull requests for your page)
- Feel free to open a provisional pull request with the page you’re working on (e.g.
gaza_1_page_00046
), if you want to avoid any potential duplication of effort. Simply close the pull request if you abandon the work.
- Copy/paste the corresponding lines from UChicago Perseus
- Read lines in image and correct transcription to reflect diplomatic ground truth of what’s represented in the image
- When done with a page, click “Download” and make a Pull Request with the output
Notes
- If a chunk is incorrectly chunked (multiple lines lumped together, or a single line cut in half), simply skip it
- Beginning of each line is usually capitalized
- Pay close attention to punctuation, accents, capitalization, and spacing
- This edition uses stigma for “στ”: ϛ
- There are also some “ου” ligatures: ȣ
- Iliad book numbers are referred to by capital Greek letters: Α = 1, Η = 7, Ν = 13, Υ = 20
Iliad Pages
Paraphrase Pages
Transcribing some additional pages of the paraphrase itself may be more time-consuming, but will likely improve the generalization of the OCR training to the rest of the paraphrase pages: