Problems with Abbyy Finereader 11

Tonight I realised that I was getting close to the end of one section of Ibn Abi Usaibia, and that the next 350 pages was in sight.  I thought that it might be a good idea to create a Finereader project for those pages, and run the optical character recognition on them, and do a few global search-and-replaces.

So far I have been working with Finereader 10, although I did a small experiment with Finereader 11 when I got it.  But this new chunk is an obvious break-point to move up.

I started up Finereader 11, and attempted to import my settings — primarily my custom English-Arabic language setting — from FR10.  This promptly crashed.

I restarted FR11, and after a bit of fiddling recreated the language and saved it.  I then opened the PDF with the 350 files, which was fine.

Then I OCR’d the lot.  This seemed to go OK; and then started popping up horrible-looking internal error messages.  In fact it just would not allow me to view the “read” pages.

I ended up going back to FR10, which is running at the moment.  Doubtless I have done something wrong, but it is troublingly easy to crash FR11.

Share

From my diary

I’ve spent part of this afternoon working on proofing a fresh chunk of the translation of Ibn Abi Usaibia, History of Physicians.  I’m beginning to find that I need to make global find and replaces in each chunk for the same sorts of things: Abu needs to become Abū, Air to become Alī, Ibrāhfm change to Ibrāhīm, and so on.  Unfortunately Finereader does not give me any macro facility; I have to hit Ctrl-H and go through as much of the list as I can remember.

What is needed, obviously, is a script.  Or else a macro facility, or some kind of automation.  It needs to operate by recording what I do, and then be editable.

Anyone got any suggestions?  I have tried AutoIt and AutoHotKey, and neither has the recorder facility.  AutoHotKey claims to have it, but it is not in fact installed.

Share

Converting DjVu into PDF

The volumes of the GCS at the Kaiser Wilhelm Library in Posen in Poland are in .DjVu format, which is rather inconvenient.  So today I have been looking at whether it is possible to convert them to PDF.  I’ve had some success, I must say.

I obtained a copy of IrfanView from the web.  You need the basic .exe download, but also the plugins, because one of these makes it possible to work with .djvu files.

Once I had installed this, I opened the index.djvu for one of the GCS volumes.  This in fact opened all the files, as it does in the DjVu reader.  I then followed the instructions here:

1) With “IrFanView” go to “File->Print” or ‘Ctrl+p’

2) On the window select “Printer: Adobe PDF”, hit “Printer setup” for the paper size you want, etc…., in the middle of that window says “”Print size” select “Best fit to page(aspect ratio)”

3) On the right side of that window you will see the Preview, under preview is “Multiple images” select “Print all pages”

4) When you’re finished hit “Print” and is going to ask you the name of file you want to save it.

And that’s it!! after severals minutes (i think hours, depending on how many images the DJVU file has) you’re going to have a PDF file with the info you want!

But it doesn’t take hours.  However I did run into a glitch: I got this error:

%%[ ProductName: Distiller ]%%
%%[Page: 1]%%
%%[Page: 2]%%
%%[Page: 3]%%
%%[Page: 4]%%
%%[Page: 5]%%
%%[Page: 6]%%
%%[Page: 7]%%
%%[Page: 8]%%
%%[Page: 9]%%
%%[Page: 10]%%
%%[Page: 11]%%
%%[Page: 12]%%
%%[Page: 13]%%
%%[Page: 14]%%
%%[Page: 15]%%
%%[Page: 16]%%
%%[Page: 17]%%
%%[ Error: invalidfileaccess; OffendingCommand: showpage ]%%
%%[ Flushing: rest of job (to end-of-file) will be ignored ]%%
%%[ Warning: PostScript error. No PDF file produced. ] %%

A bit of hunting around revealed an answer:

… the issue appears to be with my Kaspersky Anti-Virus software. By setting a check mark against most of the exclusions in the Kaspersky application control for Acrobat Distiller everything now seems to be working OK.

I.e. in Settings … Application Control … Applications … (long pause when you hit that button!) … ADOBE SYSTEMS, then right-click on Acrobat Distiller, Application Rules … Exclusions, and check everything except “Do not scan network traffic”.

This worked; and Irfanview ran through 500+ pages and created a perfectly good PDF, some 500Mb in size.

The only downside is that I ended up with a white margin on the right and bottom, where the image was padded out to A4 (or whatever).  Nothing I could do would change that.  Probably I just haven’t got the settings just right.

Share

Nuance Omnipage 18

This morning I got hold of Nuance Omnipage 18 standard edition.  The box was very light: mostly air, a CDROM, and a cheeky bit of cheaply printed paper announcing that they included no manuals at all, in order to save the planet.  Humph.

The footprint is quite small, and I copied the CDROM to my hard disk before installation.  Curiously the disk packet had two numbers both labelled as “serial number”.

The installation was unfamiliar.  As I always do, I clicked on the “select options” and found that it wanted to install some voice-related stuff.  I unchecked that.  Then I went ahead and did the install.  At one point it announced that it was going to install something called “CloudConnector”, without giving me the chance to decline.  But I hit cancel, and the rest of the install went fine.  It then popped up a box asking me to register — this opened a web page with a rather shoddy page collecting details.  Every page gave an “invalid certificate” error in IE, which is sloppy.  And then it asked if I wanted to activate, which I did.  So far, so good.

I then opened OP.  It popped up some “friendly” menu, which I removed.  Then I looked at the main screen, and decided to open a PDF and work on it in OP.  It took a little while to work out that I needed “Process … Workflows … PDF or Scanned Image to Omnipage document.  Somehow I think “File … Open” would be rather more normal!  Once you’ve selected this, you click on a button on the tool bar to start processing.  It prompted for a PDF, which I had created myself from some digital photos of Ibn Abi Usaibia, and it promptly objected “non-supported image size” to each page and refused to open it!  Silly programme: I don’t care what the image size is, I want to get some OCR of the pages! 

OK, let’s see if I can workaround.  I select instead “Camera image to Omnipage document” and select a bunch of the same images before I put them in a PDF.  This time it decides to cooperate.  It reads the images, rotates them to portrait mode (correctly).  Then it pops up some kind of dictionary thing, which is annoying.   I hit “close” and the windows cursor starts spinning.  It doesn’t seem to be doing anything, but it’s just sitting there.  Hum.

After a while I get bored, and close the program down.  At least it dies gracefully, prompting me to save my work.  I reopen it, and reopen my project.  Then I click the “Text editor” tab.  It looks as if it recognised page 1 OK, despite being typescript.  No errors, anyway.  My first encounter with OCR quality is  good.

But … I can only see EITHER the image, or the recognised text, not both at the same time.  Hum.  It ought to be possible to do this.  After a bit of hunting, I find “Window … Classic view” which gives me side-by-side.  But I go back to “flexible view”, because I have just discovered that, if I click on the text window, the line of text from the image appears in a hover box above the line.

Now this is really rather convenient.  Mind you, when the lines are slanted — as is often the case — I wonder how it would do?

I hit Alt-Down, and nothing happens.  Of course, this is not Finereader.  A bit of hunting and the Edit menu informs me that Ctrl-PgDn is next page.  F4 is next suspect character.  I never used this in Finereader, but here using it with the hover boxreally works.  My text here has quite a few vowels with overscores.  None of these are recognised by default, but at least I can see them!

So far, not too bad!  Better, indeed, than I had feared.

Now I need to start adding custom characters.  I want to define my own “language” for recognition, based on English but with all the funny characters that I need in this document to represent long vowels.  “Tools … Options” seems to give me choices.  On the process tab I see a box saying “Open PDF as images”.  Its unchecked by default — I’ll check it now, and see if I can open that PDF.  Looks as if you have to save your settings; I save mine to the same directory where I stored the install CDROM.  Then I do “File … New”, and … still can’t open my PDF.  Oh well.

Back to the OPD project from the digital images.  Can I define some extra characters?  Well you can; but it all looks rather weedy compared to Finereader’s options.  Let’s try these: āīōūšŠ.  I get them from charmap, pointing at the Alphabetum Unicode font; but any reasonably full unicode font such as Ms Arial Unicode or Titus Cyberbit Basic would do.  Then “Tools… Options … OCR … Additional characters” and I just paste them into the box.  The “…” button next to that box leads to some weedy, underspecified lookup, which really needs to be more like Charmap.  But do these characters get picked up?

Now I want to re-recognise.  I click on the thumbnail for page 1 and … the menu gives me no option.  Hum.  Wonder what to do. 

In fact I’ve spent some time now trying to work out how to kick off a limited re-read.  No luck yet.  Surely this should be simple and obvious?  Eventually I work out that you select the thumbnails of the pages you want, and hit the toolbar button and that kicks it off.

So how does it do?  Well, it recognises the overscore a.  None of the other characters are picked up.  That’s not so good as Finereader. 

Also the more skewed the page is, the less well OP handles it (understandable), and the less easy it is  to fix.  OP rather presumes that the recognition is near perfect, and has only limited fixing to do.  In such a situation, indeed, OP will be quicker to do a job than Finereader.  And I notice that a ribbon with characters to paste is across the top of the text window — nice touch.  This motivates me to go back and explore again.  I haven’t worked out how to set MY characters in that ribbon.  But when I went into the weedy charmap substitute, there was a similar ribbon at the top, and right-clicking on it allowed you to add more character sets, which increased the number of characters; and by clicking on them, to add them to the ribbon.  How you remove them from the ribbon I don’t know.  It is, in truth, a badly designed feature.  And the OCR still doesn’t recognise what I need.

I’ve had enough for now and closed it down.  Is it any good?  Almost certainly.  It’s less good for weird characters.  But it undoubtedly will see service.

UPDATE: Have just discovered, on starting Word 2010, that Nuance have seen fit to mess with the menus in this (without asking me).  Drat them!

Share

First impressions of Abbyy Finereader 11

Finereader 11 looks quite a lot like Finereader 10.   So far, it seems very similar.  Once nice touch is that when it is reading a page, a vertical bar travels down the thumbnail.

But I have already found an oddity.  I imported into it the project that I am currently working on in Finereader 10 — part of Ibn Abi Usaibia — and it looks really weird!  All the recognised text is spaced out vertically!  The paragraph style is “bar code”, and no other styles are available. 

Here’s what I see when I open it:

Opening a Finereader 10 project in Finereader 11

Not very useful, is it?  But when I minimise the image, and increase the recognised panel to 100% size, it looks like this!

Finereader 11 – zoomed version of recognised text

There seems to be no rhyme or reason for the massive gaps between lines.  And here is the very same project in Finereader 10:

Finereader 10 image of same document

Weird.  Doubtless there is some setting to persuade FR11 to behave, but it isn’t obvious what.  This does NOT happen when I recognise the page again in FR11.  The style gets set to “Body Text (2)”, in this case. 

And … when I do Ctrl-Z, and revert the recognition, it goes back to the weird appearance above.  But … this time, a bunch of other styles are available, and if I change to BodyText2, that is what I get.  But on the next page … once again, Barcode is the only style. 

This must be a bug, I think.  It means that Abbyy’s testers have not tested importing documents from FR10 sufficiently.  What it means is that you can’t upgrade projects once you start them.  Well … I try to keep my projects small, and break up large documents into small chunks, so I shan’t mind.  That would seem to be the workaround.

One good feature that is new, is that it remembers where you were in the document last time.  All previous versions always opened the document at page 1.  I got quite accustomed, indeed, to placing a “qqq” at the point where I stopped, so I could find it again next time.  No need in FR11, it seems.

Also FR11 comes bundled with “PDF Transformer 3”.  This suggests that the latter product was bought in, to beef up the rather unremarkable PDF handling in Finereader.  I’ve not tried this yet, tho.

Share

OCR: Omnipage and Finereader

Scanning and OCR is on my mind at the moment.  A new version of Abbyy Finereader — version 11 — is out.  Since I have some 750 pages of Ibn Abi Usaibia to do, any improvement in accuracy is welcome, however slight. 

Originally I did my OCR using Omnipage.  It is many years since I was led (by Susan Rhoads of Elfinspell.com) to look at Finereader 5.  This was immensely superior, and I have never used any other product since.  But I see that Omnipage 18 is now out.  Stirred by a bit of curiosity, I’ve been wondering what this would be like.

Finereader is not without its faults.  Foremost among them, for what I want to do, is that it cannot make a PDF searchable without making the PDF much, much larger, messing with the images, and so forth.  This is so bad, in fact, that I use Adobe Acrobat Pro 9 for that task, despite the much inferior OCR.

Omnipage seems to be aware of the issue, and a look at their site suggests that they realise that a lot of this activity goes on.

I decided, therefore, to buy both and see what they’re like.  I will let you know!

But … software vendors are thieves and robbers!  If you go to the Abbyy site, the cost of a downloaded upgrade to Finereader Pro 11 is “€ 89 / £ 65 (download)”.  The full version is “€ 129 / £ 99” — and if you want just the download, it’s exactly the same price, despite the fact that it costs them less!  But go to Amazon.co.uk, the complete boxed set is just £63.16 — less than the upgrade.  Needless to say, that’s what I ordered.

Omnipage are no better.  Go to the Nuance site, and Omnipage 18 (standard version) is £79.99, whether download or boxed.  Again they swindle the download users.   But go to Amazon.co.uk, and the complete boxed set is £46.90!

I didn’t buy the Omnipage Pro version, but stuck with the standard one.  It’s a lot more money, and I wasn’t convinced that I’d use the extra features — especially since I don’t know if the OCR is any good at all.  Here a trial version would have helped — Finereader make trial versions available online.  This is smart marketing on their part, because magazine reviews of such a specialised area of software are invariably useless.

My current interest in Russian texts of Methodius means that I was interested to see that Omnipage offer a separate Russian version.  Finereader used to have a specific “Cyrillic option” version — indeed I owned a copy, back in the FR5 days — but this seems to have vanished from their product list.  Kudos to Finereader: Russian support is included in the main product!  I only wish their obscure “fraktur” recognition module was included too!  This recognises old “Gothic”-style typefaces, and some of us would find it handy.  But I could only find it in their SDK for Linux.  And it doesn’t seem that you can even buy the latter off-the-shelf.

Share

Tertullian.org may go down over the next day or two briefly

I’m transferring the domain name from Network Solutions — who are a pain to deal with — to PairNIC.  Unfortunately the latter won’t let me enter the domain name servers until the transfer actually happens.  Tomorrow is Sunday, when I do not use my computer or the web.  So it is possible that I will miss the emails.  All the rest of my domains should work fine.  My apologies for this.

Share

From my diary

Oh bother … the cough I have been struggling with for the last week or so, and the sensitive stomach that I have lived with for nearly three weeks, have ganged up now with a streaming cold that came on last night.  It must be holiday time!  This business of living in an organic construct is not that great an idea, sometimes.  Everyone in our office is starting to cough and choke, so I imagine we will all get it.  It will stop me doing much this weekend, I suspect.

Last night was productive, tho.   I realised that I had only 8Gb left of the 500Gb on my PC.  Where had it gone, I wondered? 

I always use WinDirStat to work out which directories are hogging the space.  In this case, I found that one working directory for an OCR task had taken some vast area of disk, and I moved it out to my two external backup hard disks.  Finereader 10 is really a disk hog! 

Another 40Gb (!) was being occupied by two Internet Explorer temporary log files, named brndlog.txt and brndlog.bak.  I also took the time to reorganise a bit, as I found multiple copies of some large PDF’s.  After an hour or so I had 89Gb spare. I also backed everything up to the two backup drives. 

Very pleased with myself after that!

Share

OCR with macrons and other funny letters in Finereader

I’m scanning Brockelmann’s Geschichte der arabischen Litteratur.  It’s mostly in German, of course; but the Arabic is transliterated using a wide variety of odd unicode characters.  There are letter “a” with a macron over it (a horizontal line), and “sh” written as “s” with a little hat on it and so forth.  These don’t occur in modern German, so get weeded out.

But you can do this, in Finereader.  You just define a new language, based on German.  I called mine “German with Arabic”.  And when you do, you specify which unicode characters the language contains.  So all I had to do was scroll down through the unicode characters, find the funnies that Brockelmann had used, and add them in.

And, if you don’t get them all first time, you can edit the language, select it, get the properties, and add the next few in.  And … it works.  It really does.

Finereader is really amazing OCR software.  And I learned all this from the help file.  Look under “alphabet” in the search.

Share