Converting old HTML from ANSI to UTF-8 Unicode

This is a technical post, of interest to website authors who are programmers.  Read on at your peril!

The Tertullian Project website dates back to 1997, when I decided to create a few pages about Tertullian for the nascent world-wide web.  In those days unicode was hardly thought of.  If you needed to be able to include accented characters, like àéü and so forth, you had to do so using “ANSI code pages”.  You may believe that you used “plain text”; but it is not very likely.

If you have elderly HTML pages, they are mostly likely using ANSI.  This causes phenomenal problems if you try to use Linux command line tools like grep and sed to make global changes.  You need to first convert them to Unicode before trying anything like that.

What was ANSI anyway?

But let’s have a history lesson.  What are we dealing with here?

In a text file, each byte is a single character.  The byte is in fact a  number, from 0 to 255.  Our computers display each value as text on-screen.  In fact you don’t need 256 characters for the symbols that appear on a normal American English typewriter or keyboard.  All these can be fitted in the first 127 values.  To see what value “means” what character, look up the ASCII table.

The values from 128-255 are not defined in the ASCII table.  Different nations, even different companies used them for different things.  On an IBM these “extended ASCII codes” were used to draw boxes on screen!

The different sets of values were unhelpfully known as “code pages”.  So “code page” 437 was ASCII.  The “code page” 1252 was “Western Europe”, and included just such accents as we need.  You can still see these “code pages” in a Windows console – just type “chcp” and it will tell you what the current code page is; “chcp 1252” will change it to 1252.  In fact Windows used 1252 fairly commonly, and that is likely to be the encoding used in your ANSI text files.  Note that nothing whatever in the file tells you what the author used.  You just have to know (but see below).

So in an ANSI file, the “ü” character will be a single byte.

Then unicode came along.  The version of unicode that prevailed was UTF-8, because, for values of 0-127, it was identical to ASCII.  So we will ignore the other formats.

In a unicode file, letters like the “ü” character are coded as TWO bytes.  This allows for 65,000+ different characters to be encoded.  Most modern text files use UTF-8.  End of the history lesson.

What encoding are my HTML files using?

So how do you know what the encoding is?  Curiously enough, the best way to find out on a Windows box is to download and use the Notepad++ editor.  This simply displays it at the bottom right.  There is also a menu option, “Encoding”, which will indicate all the possibles, and … drumroll … allow you to alter them at a click.

As I remarked earlier, the Linux command line tools like grep and sed simply won’t be serviceable.  The trouble is that these things are written by Americans who don’t really believe anywhere else exists.  Many of them don’t support unicode, even.  I was quite unable to find any that understood ANSI.  I found one tool, ugrep, which could locate the ANSI characters; but it did not understand code pages so could not display them!  After two days of futile pain, I concluded that you can’t even hope to use these until you get away from ANSI.

My attempts to do so produced webpages that displayed with lots of invalid characters!

How to convert multiple ANSI html files to UTF-8.

There is a way to efficiently convert your masses of ANSI files to UTF-8, and I owe my knowledge of it to this StackExchange article here.  You do it in Notepad++.  You can write a macro that will run the editor and just do it.  It runs very fast, it is very simple, and it works.

You install the “Python Script” plugin into Notepad++ that allows you to run a python script.  Then you create a script using Plugins | Python Script | New script.  Save it to the default directory – otherwise it won’t show up in the list when you need to run it.

Mine looked like this:

import os;
import sys;
import re;
# Get the base directory
filePathSrc="d:\\roger\\website\\tertullian.old.wip"

# Get all the fully qualified file names under that directory
for root, dirs, files in os.walk(filePathSrc):

    # Loop over the files
    for fn in files:
    
      # Check last few characters of file name
      if fn[-5:] == '.html' or fn[-4:] == '.htm':
      
        # Open the file in notepad++
        notepad.open(root + "\\" + fn)
        
        # Comfort message
        console.write(root + "\\" + fn + "\r\n")
        
        # Use menu commands to convert to UTF-8
        notepad.runMenuCommand("Encoding", "Convert to UTF-8")
        
        # Do search and replace on strings
        # Charset
        editor.replace("charset=windows-1252", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=iso-8859-1", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=us-ascii", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=unicode", "charset=utf-8", re.IGNORECASE)
        editor.replace("http://www.tertullian", "https://www.tertullian", re.IGNORECASE)
        editor.replace('', '', re.IGNORECASE)

        # Save and close the file in Notepad++
        notepad.save()
        notepad.close()

The indentation with spaces is crucial for python, instead of curly brackets.

Also turn on the console: Plugins | Python Script | Show Console.

Then run it Plugins | Python Script | Scripts | your-script-name.

Of course you run it on a *copy* of your folder…

Then open some of the files in your browser and see what they look like.

And now … now … you can use the Linux command line tools if you like.  Because you’re using UTF-8 files, not ANSI, and, if they support unicode, they will find your characters.

Good luck!

Update: Further thoughts on encoding

I’ve been looking at the output.  Interesting this does not always work.  I’ve found scripts converted to UTF-8 where the text has become corrupt.  Doing it manually with Notepad++ works fine.  Not sure why this happens.

I’ve always felt that using non-ASCII characters is risky.  It’s better to convert the unicode into HTML entities; using ü rather than ü.  I’ve written a further script to do this, in much the same way as above.  The changes need to be case sensitive, of course.

I’ve now started to run a script in the base directorym to add DOCTYPE and charset=”utf-8″ to all files that do not have them.  It’s unclear how to do the “if” test using Notepad++ and Python, so instead I have used a Bash script running in Git Bash, adapted from one sent in by a correspondent.  Here it is. in abbreviated form:

# This section
# 1) adds a DOCTYPE declaration to all .htm files
# 2) adds a charset meta tag to all .htm files before the title tag.

# Read all the file names using a find and store in an array
files=()
find . -name "*htm" -print0 >tmpfile
while IFS= read -r -d $'\0'; do
      #echo $REPLY - the default variable from the read
      files+=("$REPLY")
done <tmpfile
rm -f tmpfile

# Get a list of files
# Loop over them
for file in "${files[@]}"; do

    # Add DOCTYPE if not present
    if ! grep -q "<!DOCTYPE" "$file"; then
        echo "$file - add doctype"
        sed -i 's|<html>|<!DOCTYPE html>\n<html>|' "$file"
    fi

    # Add charset if not present
    if ! grep -q "meta charset" "$file"; then
        echo "$file - add charset"
        sed -i 's|<title>|<meta charset="utf-8" />\n<title>|I' "$file"
    fi

done

Find non-ASCII characters in all the files

Once you have converted to unicode, you then need to convert the non-ASCII characters into HTML entities.  This I chose to do on Windows in Git Bash.  You can find the duff characters in a file using this:

 grep --color='auto' -P -R '[^\x00-\x7F]' works/de_pudicitia.htm

Which gives you:

Of course this is one file.  To get a list of all htm files with characters outside the ASCII range, use this incantation in the base directory, and it will walk the directories (-R) and only show the file names (-l):

grep --color='auto' -P -R -n -l '[^\x00-\x7F]' | grep htm

Convert the non-ASCII characters into HTML entities

I used a python script in Notepad++, and this complete list of HTML entities.  So I had line after line of

editor.replace('Ë','&Euml;')

I shall add more notes here.  They may help me next time.

Share

Leave a Reply