I’ve been converting a load of Greek text into unicode using Greek transcoder with much success. But I ran across a glitch. Depending on the option chosen, the accents can all end up to one side!
The option responsible is this one, “Use composing characters”.
I checked that, and I should not have done; it caused off-centre accents. But what on earth does it mean?
A hunt around the web reveals that you can do all those accents in one of two ways. Firstly you can use a character that includes them all inside the character. Alternatively you just type ‘alpha’ followed by the accents, and the browser and editor should render them all correctly as one letter with some accents on the top. The former is called “using precomposed characters”; the latter “using composing characters”. The latter does not work very well, as applications don’t support it. This TLG PDF says more.
I’ve also noticed that on XP the WINWORD executable tends to hang around in memory after you exit a document. If you copy the .dot files for Greek Transcoder into the Application Data\Microsoft\Word\Startup directory, they are only picked up when the WINWORD executable starts, so get ignored in this case. I’ve had to manually terminate it to get the utility to appear.
Once it’s loaded, the buttons appear in Word.
Often some character sets have to have additional control characters to indicate how a diacritic or punctuation character is supposed to appear in the text flow. This is really common with Hebrew and Arabic with weakly and neutrally typed punctuation characters. A control character helps tell some formatting processors precisely where a diacritic or punctuation character faces in the flow. If you’re converting to Word, I wouldn’t bother. If you’re creating XML files, depending on the formatter you use, you might need the control characters. (I’m sure the latter comment is a whole different kind of Greek than which you’re accustomed.)
Thanks for your note. I’m working in Word, myself.
On my last job I became familiar with the problem of unicode in XML, and the necessity to encode the entities. I think you’d probably be still best off to use the code for the precomposed characters.
In Greek, absolutely. Most formatters know how to handle left to right languages. When you start playing with Farsi, Arabic, and Hebrew, you have to tell the formatter how parenthesis and quotation marks are supposed to be displayed. Of course, that’s probably less of an issue with ancient texts and more necessary for modern texts (which is what I deal with).
Quotation marks are a problem for editors of ancient texts, tho, because ancient texts did not have them. I’ve just been reading a set of guidelines (in French) for handling these issues, and they end up using French quotes (chevrons), double-quotes and single quotes. Anything for clarity!
Right to left in XML sounds dicey. I’ve never tried. Do you get right-to-left text?
RTL processing just depends on the stylesheets and the formatter you use. I support AntennaHouse XSL-FO formatter for my clients, and it does RTL very well. The advantage is that we can run the same XML file through with different settings and automate the formatting in different page sizes and styles. For editing, you have to have an XML editor and fonts that support RTL. Arbortext Editor does well with just about any language (including real problem charsets like Thai and Khmer).