A correspondent has sent me a very interesting message from a Bruce Robertson, taken from the Digital Classicists list, which I think might interest people here.
Federico and I have been working quite a bit on Greek OCR this past year, and have made some advances since the publications below. We now have a process on Compute Canada servers using my Gamera-based ‘Rigaudon’ code
https://github.com/brobertson/rigaudon
This process undertakes OCR at multiple levels of darkness, uses a weighted Levenshtein distance correction system that I’ve worked on, and when possible it combines Greek and Latin-script OCR to produce a good mixed result, preserving information in the app. crit.
This group is probably most interested in looking over the results.
Here’s a typical volume in Teubner serif font, which took about 2h to run on our 40 cores (all results are pure machine output, without manual spellcheck or other human intervention):
Here’s a rather challenging papyrological text:
We also have Teubner sans font texts working:
And the more challenging Didot foundry:
And of course, Oxford:
If you’re into bleeding-edge experiments, here’s some Migne:
There are many more, along with some experiments (successful or otherwise) at:
http://heml.mta.ca/Rigaudon/Views/SideBySide/
I have set up a public spreadsheet for OCR requests from archive.org volumes, here:
https://docs.google.com/spreadsheet/ccc?key=0ArJt01185Q8mdERsS2VRMngtTWRNUDAtNGFxZXhFQVE&usp=sharingand I’d be delighted if anyone on the list wanted to add a request, or just email me with a request. Output will be in standard HOCR or plain text.
Currently, I’m working on a classifier for Migne, which is a very challenging but potentially quite useful series of volumes. We’re also working on implementing an idea Federico had quite a while ago, aligning the output of multiple engines or runs to improve the overall output. This would allow one to add the best of Nick White’s recent important work on Tesseract to the output of Rigaudon, for instance.
The code is a script written in the Python language. The code requires that you first install Gamera (also written in Python). I believe Python can run on Windows as well as on Unix.
If I had any time, I’d be interested to find out how well this runs. But a caveat: when I looked at the home page, I saw the dreaded words that it talked about “training” the code to recognise characters. I suspect this stuff is not mature enough for normal people.
All the same, this is excellent work!
Dear Roger,
May God reward your efforts! It is a really important step towards the most necessary text for the whole Christianity.
I see some important things to be done:
1. an automatic proof reader, based on a defined dictionary (there are such dictionaries on internet, i have one of them for Classical Greek, to be used in Open Office). This automatic proof reader could also use free morphological engines – like those used by Diogenes (made many years ago by P. Heslin) and the team behind Kalos – whenever an unknown term apear. If the machine doesn’t know what to do, it may ask you what to do next.
2. a trainer to learn new typesets, like those with ligatures (a lot of great books were printed this way and never reedited).
3. an end user package, so that anyone could install it in his own machine and run it.
As you already said: Great work!
Please, keep us informed with the progress.
monk Filotheus
All good ideas. One day, let’s hope, Greek OCR will be as common as English!
Roger:
Thanks for posting about our project. Rigaudon is meant to run in a massively parallel environment, especially HPCs, though I do test it on a computer with 8 cores, and that machine can render this in 1/2h:
http://heml.mta.ca/Rigaudon/Views/SideBySide/septemadthebased00aescuoft_2013-03-26-18-11_Kaibel_Round_4_No_Latin_340_sidebyside/
Our goal is to assist large-scale digitization projects by providing very high quality output, requiring only a small amount of editing. It is hoped that a good sample of multiple editions of all ancient Greek can be rendered into a freely-licensed TEI-XML form within the next few years. We are also collaborating with researchers who are interesting in extracting Greek quotations from secondary works.
If you’re interested in desktop-quality OCR of polytonic Greek, check out the latest builds of Google’s tesseract (3.02.02), which Nick White of Durham university has hacked to provide really good Greek output.
Federico and I will be integrating tesseract into the Rigaudon system, using at least these two OCR engines to raise the level of accuracy and certainty.
Filotheus —
You’ve got some great ideas!
1. Our system does encorporate two levels of spell-check. The first, for which I’m responsible, is meant to be a very ‘light’, automatic process, unlikely to make any incorrect replacements. It’s based on a weighted Levenshtein algorithm, with the weightings tuned to the peculiarities of the particular classifier and font.
My colleague Federico Boschetti of the CNR, Pisa, (who wrote the ancient Greek aspell plugin) has written an OCR editing environment which encorporates spellcheck and alignment with a known text, if such a thing exists. Currently, its demoed at:
http://cophidev.ilc.cnr.it:8080/CoPhiProofReader-1.0-SNAPSHOT/
(Select a page further down to see what’s going on.)
Both spellchecks are based on the Perseus morpheus engine, using flat files generated by it.
2. Rigaudon is built upon Gamera, which has an assisted training environment. The fonts you describe are sometimes quite difficult, though, because the components are very connected. I did a wide study of existing fonts and OCR two years ago, and it is published at:
http://heml.mta.ca/RobertsonGreekOCR/
3. Rigaudon is available on github, and can be installed, but I think it is more likely that individual users will enjoy Nick White’s work on Tesseract, esp. if people keep working away on training sets for that.
The major challenge is to OCR Migne to the level that it is affordable to edit it up to a high-quality text. This will take both very good images (600 dpi minimum) and every trick we can bring to bear.
Thank you so much for adding these comments, which are very illuminating! I had not heard of Tesseract, I admit.
Thank you all for your work in this field. I look forward to the day when these tools become available to those of us with merely intermediate technical abilities.
Yes, Tesseract is probably what you want. The Ancient Greek training I made for it is pretty good, there are several graphical interfaces available, it’s available for Linux, Mac and Windows, and it’s free software.
The website for tesseract is http://code.google.com/p/tesseract-ocr/
Download Tesseract and the Ancient Greek language data, and if you want a graphical interface there is a list at http://code.google.com/p/tesseract-ocr/wiki/3rdParty#GUI (I’d probably recommend gImageReader).
Thank you, Nick – that is really helpful. I shall give it a go!