How does “AI translation” work? Some high-level thoughts

The computer world is a high-bullshit industry.   Every computer system consists of nothing more than silicon chips running streams of ones (1) and zeros (0), however grandly this may be dressed-up.  The unwary blindly accept and repeat the words and pictures offered by salesmen with something to sell.  These are repeated by journalists who need something to write about.  Indeed the IT industry is the victim of repeated fads.  These are always hugely oversold, and they come, reach a crescendo, and then wither away.  But anybody doing serious work needs to understand what is going on under the hood.  If you cannot express it in your own words, you don’t understand it, and you will make bad decisions.

“AI” is the latest nonsense term being pumped by the media.  “Are the machines going to take over?!” scream the journalists.  “Your system needs AI,” murmur the salesmen.  It’s all bunk, marketing fluff for the less majestic-sounding “large language models (LLM) with a chatbot on the front.”

This area is the preserve of computer science people, who are often a bit strange, and are always rather mathematical.  But it would seem useful to share my current understanding as to what is going on, culled from a number of articles online.   I guarantee none of this; this is just what I have read.

Ever since Google Translate, machine translation is done by having a large volume of texts in, say, Latin, a similarly large volume in English, and a large amount of human-written translations of Latin into English.  The “translator” takes a Latin sentence input by a human, searches for a text containing those words in the mass of Latin texts, looks up the existing English translation of the same text, and spits back the corresponding English sentence.  Of course they don’t just have sentences; they have words, and clauses, all indexed in the same way.  There is much more to this, particularly in how material from one language is mapped to material in the other, but that’s the basic principle.  This was known as – jargon alert – “Neural Machine Translation” (NMT).

This process, using existing translations, is why the English translations produced by Google Translate would sometimes drop into Jacobean English for a sentence, or part of it.

The “AI translation” done using an LLM is a further step along the same road, but with added bullshit at each stage.  The jargon word for this technology seems to be “Generative AI”.

A “large language model” (LLM) is a file.  You can download them from GitHub.  It is a file containing numbers, one after another.  Each number represents a word, or part of a word.  The numbers are not random either – they are carefully crafted and generated to tell you how that word fits into the language.  Words relating to similar subjects have numbers which are “closer together”.  So in a sentence “John went skiing in the snow,” both “snow” and “skiing” relate to the same subject, and will have numbers closer together than the same number for “John.”

Again you need a very large amount of text in that language on both sides.  For each language, these texts are then processed into this mass of numbers.  The numbers tell you whether the word is a verb or a noun, or is a name, or is often found with these words, or never found with those.  The mass of numbers is a “language model”, because it contains vast amounts of information about how the language actually works.  The same English word may have more than one number; “right” in “that’s right” is a different concept to the one in “the politicians of the right.”  The more text you have, the more you can analyse, and the better your model of the language will be.  How many sentences contain both “ski” and “snow”?  And so on.  The model of how words, sentences, and so on are actually used, in real language texts, becomes better, the more data you put in.  The analysis of the texts starts with human-written code that generates connections; but as you continue to process the data, the process will generate yet more connections.

The end result is these models, which describe the use of the language.  You also end up with a mass of data connecting the two together.  The same number in one side of the language pair will also appear in the other model, pointing to the equivalent word or concept.  So 11050 may mean “love” in English but “am-” in Latin.

As before, there are a lot of steps to this process, which I have jumped over.  Nor is it just a matter of individual words; far from it.

The term used by the AI salesmen for this process is “training the model.”  They use this word to mislead, because it gives to the reader the false impression of a man being trained.  I prefer to say “populating” the model, because it’s just storing numbers in a file.

When we enter a piece of Latin text in an AI Translator, this is encoded in the same way.  The AI system works out what the appropriate number for each token – word or part-word – in our text is.  This takes quite a bit of time, which is why AI systems hesitate on-screen.  The resulting stream of encoded numbers are then fed into the LLM, which sends back the corresponding English text for those numbers, or numbers which are mathematically “similar”.  Plus a lot of tweaking, no doubt.

But here’s the interesting bit.  The piece of Latin that we put in, and the analysis of it, is not discarded.  This is more raw data for the model.  It is stored in the model itself.

This has two interesting consequences.

The first consequence is that running the same piece of text through the LLM twice will always give different results, and not necessarily better ones.  Because you can never run the same text through the same LLM twice; the LLM is different now, changed to include your text.

The second consequence is even more interesting: you can poison a model by feeding it malicious data, designed to make it give wrong results.  It’s all data, at the end of the day.  The model is just a file.  It doesn’t know anything.  All it is doing is generating the next word, dumbly.  And what happens if the input is itself AI-generated, but is wrong?

In order to create a model of the language and how it is used, you need as much data as possible.  Long ago Google digitised all the books in the world, and turned them into searchable text, even though 80% of them are in copyright.  Google Books merely gives a window on this database.

AI providers need lots of data.  But one reason why they have tried to conceal what they are doing is, in part, because the data input is nearly all in copyright.  One incautious AI provider did list the sources for its data in an article, and these included a massive pirate archive of books.   But they had to get their data from somewhere.  Similarly this is why there are free tiers to all the AI websites – they want your input.

So… there is no magic.  There is no sinister machine intelligence sitting there.  There is a file full of numbers, and processes.

The output is not perfect.  Even Google Translate could do some odd things.  But AI Translate can produce random results – “hallucinations”.

Further reading

Share

Translations of St Nicholas of Myra material on this website

I’ve just created a page on this blog with links to every post that contains a translation of one or the other of the medieval texts containing St Nicholas material.  It’s here.

Looking back, I started taking an interest in 2013.  The first translations of the legends appeared in 2015.  The most recent was earlier today.

That’s a long, long time.  And how things have changed.  Back in 2015, I was commissioning translations from the Greek of various short pieces.  In 2020, Google Translate suddenly became usable, at least for Latin.  And this year, we have the new AI Translators.  It’s possible to do stuff, even if you don’t have much knowledge of the languages.  It’s rather marvellous really!

Share

Methodius ad Theodorum (BHG 1352y) Part 4 – A Draft Translation using AI

Sometimes the only way forward is to plunge in, and see what happens.  So I have taken the modern Greek translation of Methodius ad Theodorum by Ch. Stergioulis, and machine-translated it into English.  The results are attached, together with Stergioulis original, which has the ancient Greek facing the modern Greek, and footnotes at the end.

The ancient Greek original is preserved in a single Vatican manuscript, which gave the editor, G. Anrich, a lot of problems – so much so that he printed it in volume 1 of his Agios Nikolaos, and then printed a transcription of the manuscript in vol. 2, with corrections.  Some of his footnotes in vol. 1 betray bafflement; and so do some of Stergioulis’ footnotes!  Words otherwise unknown, I gather.  It’s clear that the text is corrupt.

I scanned Stergioulis’ translation into a Word document (attached).  Then for each chapter I did the following:

  1. Run the Greek through Google Translate and paste it into a Word document.
  2. Send a request to Bard AI, “please translate this from modern Greek into English, with notes”, followed by the Greek.  The notes were usually useless, but sometimes not.
  3. Send a request to ChatGPT 3.5, “please translate this from modern Greek into English” (this wouldn’t give notes), followed by the Greek.
  4. Manually modify the Google Translate output in the Word document using the output from the two AI websites.  Where all three agreed, this was no problem, and it was just a case of choosing the most pleasant version.  Where one disagreed, I looked a bit harder at it.  Where all three disagreed, I started looking up some key words in Greek dictionaries, and looked at the footnotes.  There was  in fact only one sentence, in chapter 10, where the meaning of a clause -“και φοβούμενος το λιχνιστήρι34 αυτής της ανθρωπαρέσκειας για την απόκτηση αγαθών – was completely unclear, and I got there in the end.

This was usually straightforward, not least because the chapters were small.

For a couple of chapters Bard AI threw a wobbly.  Instead of outputting a translation, it started to spew a message in Greek, basically saying “I am an AI model, I don’t know how to do that.”  It seems that it was ignoring the English part of my request string, thought that I was writing in Greek, and so it treated the Greek passage as itself a request to do something, not as something to translate.  I found that replying with “Could you translate that request from modern Greek into English?” did the trick, and caused it to translate instead.

Anyway, here is the output from this process.

The translation is readable enough.  I’m not sure how accurate it is – any comments, anyone?

UPDATE: final version here.

Share