Uploading to Archive.org

Like most people, I have become used to searching Google books and Archive.org for out-of-copyright scholarly texts.  These are an enormous blessing to us all, where books normally hidden in University rare books rooms can be downloaded as a PDF. 

I’ve become aware that it is possible to upload books to Archive.org, and have uploaded a couple of items which I have, and which were not in the archive. 

Of course the first step is to scan the book.  For this I use Abbyy Finereader 8.0, which drives a Plustek Opticbook 3600 scanner at 400 dpi.   This creates images of the pages, and all the pages in the book can be saved as a single PDF file from Finereader.  For optical character recognition, I use Finereader 9.0 (which can only drive the scanner at 300 dpi or 600 dpi, curiously) which has much improved accuracy over Finereader 8.

It is necessary to create an account on Archive.org in order to upload.  Then you get a button ‘Upload’, and can use this to do an upload of a PDF.  This will work fine.  To add extra file formats, use the instructions in the FAQ; edit the item, use the item manager, checkout the item (no download is involved in checkout), and then use an FTP interface to add more files.  I was unable to get this to work in Internet Explorer 7 or Firefox 3; but the CuteFTP programme worked fine once I disabled secure-FTP and used simple FTP. 

I added to each item a text file output, a Word document with all the formatting, and an HTML file with simple formatting only. 

I would like to encourage readers to look at their shelves and consider which texts might be usefully uploaded.  Every printed item prior to 1st January 1923 is out of copyright in the USA and so can go up.  Copyright laws in the EU and UK require knowledge of the biography of the author, as copyright there absurdly expires 70 years after the death of the author.  But union catalogues of research material like COPAC these days often indicate the birth and death date of authors, making it possible to determine status.

Share

Playing with the Google Greek->English translator

Ekaterini Tsalampouni linked to this blog from her Greek language website.  I wanted to know what she said, so I copied it and pasted it into Google language tools.  The result was really very good:

Κατάλογος ψηφιοποιημένων χειρογράφων.

Από το ιστολόγιο του Roger Pearse πληροφορούμαστε για την ύπαρξη στο διαδίκτυο καταλόγου ψηφιοποιημένων χειρογράφων του Μεσαίωνα (μεταξύ των οποίων και αρκετών της Αγίας Γραφής. Για να βρεθείτε στη βάση δεδομένων, πατήστε εδώ. Για να διαβάσετε τη σχετική ανάρτηση του Roger Pearse, πατήστε εδώ.

became

List of digitized manuscripts

From the blog of Roger Pearse information on the existence of online digitized catalog of medieval manuscripts (among them several of the Holy Scripture. To get to the database, click here. To read the suspension of Roger Pearse, click here.

What more could you reasonably want?

How would it deal with patristic Greek, I wondered?  There used to be a website at aegean.gr that had PDF’s of Greek texts from the Patrologia Graeca, but it has since vanished.  However I did have a PDF or two, so I grabbed a bit of Constantine Porphyrogenitus, and pasted it in.   Well, from

Κωνσταντίνου ἐν αὐτῷ τῷ Χριστῷ, τῷ αἰωνίῳ βασιλεῖ, βασιλέως, υἱοῦ Λέοντος τοῦ σοφωτάτου καὶ ἀειμνήστου βασιλέως, λόγος, ἡνίκα τὸ τοῦ σοφοῦ Χρυσοστόμου ἱερὸν καὶ ἅγιον σκῆνος ἐκ τῆς ὑπερορίας ἀνακομισθὲν ὥσπερ τις πολύολβος καὶ πολυέραστος ἐναπετέθη θησαυρὸς τῇ βασιλίδι ταύτῃ καὶ ὑπερλάμπρῳ τῶν πόλεων. Εὐλόγησον πάτερ.

you get

Κωνσταντίνου ἐν αὐτῷ τῷ Χριστῷ, τῷ αἰωνίῳ King βασιλέως, son Λέοντος of σοφωτάτου he ἀειμνήστου βασιλέως reason, the Wise ἡνίκα his sacred Chrysostom he scenes from the Holy ὑπερορίας anakomisthen osper the πολύολβος he πολυέραστος ἐναπετέθη treasure τῇ βασιλίδι ταύτῃ he ὑπερλάμπρῳ cities. Πάτερ blessed.

No good, in other words.  But… then I thought, is this to do with accentuation?  What happens if I remove accents?  If I turn Πάτερ into Πατερ?  Sure enough “Πάτερ blessed” became “Blessed father”!

I’m going to experiment a bit further, and see if stripping off the accents does the trick.  What do we need to do, to make this work, I wonder?  Without any accents, we get:

Κωνσταντινου εν αυτω τω Χριστω, τω αιωνιω βασιλει, βασιλεως, υιου Λεοντος του σοφωτατου και αειμνηστου βασιλεως, λογος, ηνικα το του σοφου Χρυσοστομου ιερον και αγιον σκηνος εκ της υπεροριας ανακομισθεν ωσπερ τις πολυολβος και πολυεραστος εναπετεθη θησαυρος τη βασιλιδι ταυτη και υπερλαμπρω των πολεων. Εὐλογησον πατερ.

Which becomes:

Constantine in Christ afto meantime, meanwhile eternal king, king, son of Leon and sofotatou late king, why, inika the Chrysostom of the wise and sacred AGION scenes from the yperorias anakomisthen osper the polyolvos polyerastos enapetethi treasure and the identity and vasilidi yperlampro cities. Blessed father.

Not quite there, is it?  Interestingly logos = reason in accentuated form, and =’why’ in unaccentuated form.  What am I doing wrong?

Share

People willing to type up some ancient Greek wanted

Do you have too much money?  If not, you may be interested in this post by Eric at Archaic Christianity.  He’s prepared to pay people to type in some unicode ancient Greek for him.  Might be a quick way to earn a few bucks, if you’re short of cash and have a bit of spare time.

The resulting text will be made available and public domain, so the effort will benefit everyone.

Share

Computer troubles

Merry Christmas to you all!

It’s clearly not my day, tho.  I came home to find my central heating had broken down.  That’s fixed – amazing to get an engineer on Christmas day! – but my Windows Vista laptop has decided to refuse to boot.  It gets stuck running CHKDSK.

After some effort and running the repair program on the install disk, I have managed to get it to boot; but it’s still whining a bit about this and that, and it all smacks of hard disk corruption.  This means, of course, that I can’t trust it with my data.  At this moment I’m typing this using an old laptop, and trying to do a mass copy from the PC to an external hard disk.  My data matters far more than the PC, although it’s only a few months old.

The reason I burden you with this is that it will probably affect the progress of my various projects. 

What I will need to do is get a new laptop, and make sure the thing has XP on it.  I have never had these problems in my entire career – until I bought a machine running Vista, that is.

Share

Downloadable dictionaries

It would be very helpful to be able to lookup words in French within our little translation applications.  But where to find the data?

I was able to find some simple downloadable files, made by Tyler Jones at

http://www.june29.com/IDP/IDPfiles.html

Unfortunately these are quite small.  The French consists only of 3,000 words, the German of 8,000 and so on.  And really we need much more detail.

Share

Making your own translation tools

I am a profoundly lazy man, in some respects anyway.  I hate pointless labour.  And what can be more pointless than the way many of us translate?

Imagine getting a French text in front of you.  The process goes something like this:

You read the first sentence.  You type an English version into Word.  Then you look back to the book.  A few moments of searching along the line, and you find the second sentence.  You know most of the words, but not all, so you type in a couple of them in an electronic dictionary.  Then you look back again at the page, to get the whole sentence, and spend time again fumbling for it in the mass of text.  Then you write another sentence.  And so on.

Frankly all this switching to and fro is annoying and pointless.

What we need, surely, is to turn the French into an electronic form, split it into sentences, and put each sentence on a separate line.

We could go further.  Machine translators for French are quite good.  Let’s run the electronic text through one of those.  Then split the translation into sentences, and interleave them with the French.

Won’t that be much easier?  We no longer have to find a text in a page in a book; it’s immediately above the line.  We have the machine translator’s vocabulary; that will reduce the amount of looking up.  In short, it’s easier and quicker and less painful.

I’ve written a little utility that does the splitting into sentences and the interleaving.  I use it with a machine translator, and just paste the output back into my utility.

Of course it’s limited in what it does, but the output is a  nice word document with interleaved French and English.

It’s making working on Agapius much easier anyway!  If only there were some way to hover a mouse over a French word and get a full dictionary entry.  Are there any French dictionaries in XML form?

Share

Greek words in the first millennium

This post at Vitruvian Design is very timely to a man trying to write some Greek->English translation software.  I can’t comment on it from behind this firewall, so will comment here.

I am delighted to see someone else interested in getting a master list of Greek words and morphologies for the first thousand years.  I must look into this project that is referred to.  The problem, surely, will be patristic Greek; and the answer would be to turn G.W.H.Lampe’s Patristic Lexicon into an XML file, in the same way that Perseus have done for Liddell and Scott.  Someone would have to argue with Oxford, who own the copyright; but for non-commercial use, I expect a license could be negotiated.  Lampe is out of print anyway.

I think that I know why Liddell and Scott give weird accusatives as an extra entry.  The book is designed for manual use, and someone finding an odd word is liable to look for something in that form, rather than the unknown to them base form.  But such things are unnecessary in a digital file, I agree.

Not all of the files mentioned in the post are known to me.  I know that an XML file of L&S exists in the Perseus Hopper, and also in the Diogenes download.  But I’m not clear where to find the “invaluable list” by Peter Heslin resulting from running the Perseus morphologiser over the TLG disk E.  A morphology file greek.morph.xml is part of the Perseus Hopper download.

The issue of mismatches between this and L&S is quite interesting.  I’d like to follow this more.

But one obvious omission is the New Testament.  The morphology list in MorphGNT is also available; and English meanings in the XML file of Strong’s dictionary.  These too need integrating into the project, I would suggest.

All this work is enormously valuable.  The project is also trying to establish something shockingly fundamental; a list of extant Greek literature!

I’m not sure how I feel about this.  I agree that the task should be undertaken — indeed it’s appallingly hard to find out these things, as I found out when I wanted a list of manuscript traditions — , but it seems a digression from the main IT-related task.  They’ve decided to start with poets; again, a minority taste.  I can’t help feeling that this task should be spun off.

The post also introduces me to Epidoc, of which I know little, in the context of converting to and from unicode.  If some way to do this reliably exists, I want it!  More details here.  This is the ‘transcoder’.

All in all, a super post!

Share

Better OCR with Finereader 9

Last night I ran Finereader 9 over a 400-page English translation from 1936 that I had scanned some time ago at 400 dpi.  I then settled down for the onerous task of correcting scanner errors; only to find very few indeed.  There were perhaps a dozen in the whole book!  Probably if I had just exported it to Word and used the spell-checker, I would have found most of them.

I repeated the exercise on another text, with the same result.

FR9 is perceptibly better than FR8 at OCR.  It has some annoyances in the user-interface.  Worse it forces me to use my Plustek Opticbook 3600 at 300dpi or 600dpi, when FR8 allowed 400 dpi (the optimal resolution).  But the fact is that there has been a considerable advance here. 

When I look back ten years to the misery of “99% accurate” recognition (i.e. 6 errors a page), it is truly amazing.  Recommended.

Share

Fixed width Greek unicode fonts

I’ve been trying to work with the latest version of Jim Tauber’s MorphGNT text file.  For those who don’t know it, it contains all the words in the Greek New Testament, one per line, each identified as noun/verb/plural/whatever, with the word itself as found in the text, plus the dictionary form of the word.  No English meaning; but that can be got from using the dictionary form to look up the meaning in the XML file of Strong’s dictionary.

The Greek used to be present in beta-code, but Jim has now converted to unicode.  That’s fine; except that you now need a font in which to work on it.  Like most text files, you want a fixed-width font.

I suspect Jim does his magic on linux, where one is available.  But on Windows there is no such free font.  I understand, tho, that the new version of “Courier New” shipped in Vista will do the trick.

I came across this discussion in a typographic forum, where a Microsoft font-person lurks.  It lists some of the possible commercial fonts you could use.

Share

7,300+ visitors to Tertullian.org last month

I was interested to discover from this site that apparently more than 7,300 unique individuals used my site last month.  For a site dedicated to a subject as abtruse as the Fathers, that’s not bad going.  Perhaps we underestimate interest in early Christian history?

Share