I’m working away on this Ethiopian homily of John, bishop of Axum, on St. Garima. It was printed in 1898 by C. Conti Rossini,[1] but without translation.
Well, I don’t know any Ethiopian at all, and I don’t even know the alphabet. There are 31 consonants, each of which has seven variants, I gather.
But I knew that it was possible to get Google to turn images into electronic text, and a couple of experiments with ChatGPT and DeepSeek quickly showed that the resulting output file could be understood by AI and produce English text.
So I need to get a decent electronic text.
My first step was to take the PDF, extract the pages with the Ethiopian text on them, and pull them into Finereader. Finereader does NOT support Amharic, but it has useful image editing tools. I trimmed the 24 pages down to the bare text – no footnotes, no headings, and exported them as images to a directory.
I then bundled these images up into a PDF using my incredibly elderly Adobe Acrobat Pro 9.0. I then went into Google Drive and uploaded the PDF. Then I right-clicked on it in Google Drive, and opened it in Google Docs. This caused Google to OCR it, thereby creating an electronic text. I then downloaded this in Word format.
I’ve checked the results into a local Git repository – so that I can always go back if I screw up the file.
And now, page by page, I am going through what Google has given me, removing obvious crud and irrelevant line breaks. It seems to insert a small amount of garbage between pages.
Wish me luck!
There are other free Amharic OCR websites online, and these seem to do a reasonable job too. But I’ve stuck so far with the Google Docs output.
Incidentally DeepSeek offered the opinion that the text is not in Amharic, as I had expected, but in Ge`ez, Classical Ethiopian. Luckily it doesn’t care.