OCR – Page 2 – Roger Pearse

Like most people, I have become used to searching Google books and Archive.org for out-of-copyright scholarly texts. These are an enormous blessing to us all, where books normally hidden in University rare books rooms can be downloaded as a PDF.

I’ve become aware that it is possible to upload books to Archive.org, and have uploaded a couple of items which I have, and which were not in the archive.

https://www.snyderchildcare.com/lasix-100mg-online/

Of course the first step is to scan the book. For this I use Abbyy Finereader 8.0, which drives a Plustek Opticbook 3600 scanner at 400 dpi. This creates images of the pages, and all the pages in the book can be saved as a single PDF file from Finereader. For optical character recognition, I use Finereader 9.0 (which can only drive the scanner at 300 dpi or 600 dpi, curiously) which has much improved accuracy over Finereader 8.

It is necessary to create an account on Archive.org in order to upload. Then you get a button ‘Upload’, and can use this to do an upload of a PDF. This will work fine. To add extra file formats, use the instructions in the FAQ; edit the item, use the item manager, checkout the item (no download is involved in checkout), and then use an FTP interface to add more files. I was unable to get this to work in Internet Explorer 7 or Firefox 3; but the CuteFTP programme worked fine once I disabled secure-FTP and used simple FTP.

http://masterstrack.com/order-antabuse-online/

I added to each item a text file output, a Word document with all the formatting, and an HTML file with simple formatting only.

I would like to encourage readers to look at their shelves and consider which texts might be usefully uploaded. Every printed item prior to 1st January 1923 is out of copyright in the USA and so can go up. Copyright laws in the EU and UK require knowledge of the biography of the author, as copyright there absurdly expires 70 years after the death of the author. But union catalogues of research material like COPAC these days often indicate the birth and death date of authors, making it possible to determine status.

Last night I ran Finereader 9 over a 400-page English translation from 1936 that I had scanned some time ago at 400 dpi. I then settled down for the onerous task of correcting scanner errors; only to find very few indeed. There were perhaps a dozen in the whole book! Probably if I had just exported it to Word and used the spell-checker, I would have found most of them.

I repeated the exercise on another text, with the same result.

FR9 is perceptibly better than FR8 at OCR. It has some annoyances in the user-interface. Worse it forces me to use my Plustek Opticbook 3600 at 300dpi or 600dpi, when FR8 allowed 400 dpi (the optimal resolution). But the fact is that there has been a considerable advance here.

When I look back ten years to the misery of “99% accurate” recognition (i.e. 6 errors a page), it is truly amazing. Recommended.

Roger Pearse

Tag: OCR

Uploading to Archive.org

Like this:

Better OCR with Finereader 9

Like this:

Share this:

Like this:

Share this:

Like this: