Unicode Macrons, mainly for Te Reo Māori

Tēnā koutou! Ko Tīpene tāku ingoa.

In the above sentence you should see four macronised characters. In Te Reo Māori macrons denote long vowel sounds, which can have quite a dramatic effect on meaning, for example, keke vs kēkē (cake vs armpit).

When considering your web-enabled information superhighway content delivery system you need to first slam on the brakes and consider how you want to store your content and then deliver it. Let me show you how I do so.

The Requirements

Before designing a solution, we need to know the problem. What ways do we dish this information up as? HTML? PDF? Email? RSS feed? iCalendar? vCard? XML-based format? CSV file? Excel spreadsheet? Plain text?

A very simple way to storing information is to have the server store HTML. This works well when your only outputs are HTML webpages, RSS feeds, and HTML formatted email. If you're only going to output HTML, then we simply escape all incoming data into HTML entities, store HTML in our database and in our files, and send HTML back to the user. Very simple workflow. Very easy to implement. Congratulations you can skip the rest of this page (although you may still find it useful).

However, it may end up being better and easier to store Unicode (UTF-8) and then dish up HTML webpages as HTML and dish up additional formats with no extra conversion cost. MySQL has good character set support, so we may as well use this method.

Data flow

It pays to first consider your data flow. You should already know how people move around your website. Say they submit a form on Page A, which sends POST data to Page B, and then they get redirected to Page C to see the new item.

Showing flow between web pages, from Page A to Page B to Page C

Behind all this there is the software on the server receiving the post data, and the database. When the browser submits the form to Page B, script P (as in POST) receives the data. It does things to the data, and then stores it in the database DB. Later, script G (as in GET) retrieves the data from the database and sends it to the browser, where it is displayed as Page C.

Complete flow diagram, showing pages, scripts, and the database

You need to understand this concept. Your scripts P and G are receiving and sending data from the user agent. They are communicating with the database DB. Are they receiving nonsense? Is what they send understandable?

Each box potentially can have a different type of content in it. For example, all the pages are going to be HTML. What you put in your script is up to you. When you write to the database, you must escape any potential SQL injection in the input.

Consider this:

Complete flow diagram, this time showing content type and escaping

Page B submits UTF-8 data to script P. When script P receives the data from Page B it converts the incoming data from UTF-8 to ISO-8859-1 with HTML entities in place of macrons. Before writing the data to the database, script P converts it to SQL escaped ISO-8859-1 with HTML entities. From there the entire flow is ISO-8859-1 with HTML entities.

This is workable, however, when you need to put the data into a format that does not understand HTML entities, then you run into strife.

A better idea for a workflow would be to use UTF-8. With UTF-8 we are not losing any data. Remember, some of these changes can be lossy. For example, ISO-8859-1 does not allow for macrons. It is best to store your data with the highest quality, and then export at a lower quality when required.

Also consider efficiency. Conversion between formats takes time, not much time, but still time. For every time you write to a database, you will probably read the record fifteen or more times, depending on your application and traffic. (I could have made these figures up, because I have another database that gets written to about once a year, and people read from it daily, so I can easily have the record being read a thousand times for every write.) You want to do your conversions as few times as possible. This may require a little bit of research and sensible design to figure out where you can most efficiently do the conversions. You may even wish to store multiple versions of the data, rather than converting on the fly. (This is the speed / storage space tradeoff.) I have used orange boxes to show where a conversion is taking place. They are:

  1. Between Page B and Script P
  2. Between Script P and the Database
  3. Between Script G and Page C (optional)

Also, do not do the same conversion in more than one place, because you can end up with double escaped characters, and those look really ridiculous. More on this later.

Cr&āzy!

Complete flow diagram, showing UTF-8 throughout

Alternatively you could use UTF-8 throughout your application, and avoid escaping (except to remove sneaky HTML or SQL injection attacks, which goes without saying).

The above diagram shows a flow that consists entirely of UTF-8, with only the construction of the SQL as the time when anything gets escaped.

Now, something to think about before designing your data flow is what each of these funny names means, and what is appropriate where.

An introduction to Unicode and other character sets

This is a very brief overview. You can find better resources at www.unicode.org.

I will start this introduction to Unicode by introducing you to another character set you may be more familiar with, ISO 8859-1.

Latin1 and ISO-8859-1

While not exactly the same, these two character sets are very nearly identical. These two character sets happen to be the default for a lot of software.

There are no macrons. The closest approximation is the umlaut (two dots above the character), or you could use academic macrons and double them, however this can be ambiguous. Doubled vowels cannot be shrunk back into macrons. For example, the whaka- prefix on a word can cause a vowel to be doubled, but it is not a long vowel, for example: whakaata

A character is equal to a byte.

Here is a hex dump showing on the first line what happens when you simply ignore the macrons, and on the second line what it looks like when you double the characters. [Skip Latin1]

5465 6e61 206b 6f65                      Tena koe
5465 656e 6161 206b 6f65                 Teenaa koe

Here is a table showing the different macronised characters of interest, their Unicode code point, and an ISO-8859-1 approximation:

CharacterUnicodeISO-8859-1 approximation
CAPITAL LETTER A WITH MACRONU+0100AA
SMALL LETTER A WITH MACRONU+0101aa
CAPITAL LETTER E WITH MACRONU+0112EE
SMALL LETTER E WITH MACRONU+0113ee
CAPITAL LETTER I WITH MACRONU+012AII
SMALL LETTER I WITH MACRONU+012Bii
CAPITAL LETTER O WITH MACRONU+014COO
SMALL LETTER O WITH MACRONU+014Doo
CAPITAL LETTER U WITH MACRONU+016AUU
SMALL LETTER U WITH MACRONU+016Buu

Table of HTML entities

One easy way to add macrons is to use the HTML entities for them. The general format is:

&#DDD;

Where DDD is the decimal value of the character. Unicode code points (U+HHHH) are in hex.

Here is a table showing the different macronised characters of interest, their Unicode code point, an ISO-8859-1 approximation, and the matching HTML entity:

CharacterUnicodeISO-8859-1
approximation
HTML entity
CAPITAL LETTER A WITH MACRONU+0100AAĀ
SMALL LETTER A WITH MACRONU+0101aaā
CAPITAL LETTER E WITH MACRONU+0112EEĒ
SMALL LETTER E WITH MACRONU+0113eeē
CAPITAL LETTER I WITH MACRONU+012AIIĪ
SMALL LETTER I WITH MACRONU+012Biiī
CAPITAL LETTER O WITH MACRONU+014COOŌ
SMALL LETTER O WITH MACRONU+014Dooō
CAPITAL LETTER U WITH MACRONU+016AUUŪ
SMALL LETTER U WITH MACRONU+016Buuū

Here is a hex dump showing a piece of ISO-8859-1 text with HTML entities for macrons. [Skip ISO-8859-1]

5426 2332 3939 3b6e 2623 3235 373b 206b  Tīnā k
6f65                                     oe

As you can see this is a lot less efficient than just using the plain characters! Or is it?

ISO-8859-13

After a little research (actually just reading James Gasson and Pablo Saratxaga's i18n localisations) I discovered that ISO-8859-13 actually includes macrons!

This extract from the ISO-8859-13 to Unicode mapping file shows in the first column the ISO-8859-13 code in hex, and in the second column the matching Unicode values. [Skip Unicode mapping]

0xC2    0x0100  #       LATIN CAPITAL LETTER A WITH MACRON
0xC7    0x0112  #       LATIN CAPITAL LETTER E WITH MACRON
0xCE    0x012A  #       LATIN CAPITAL LETTER I WITH MACRON
0xD4    0x014C  #       LATIN CAPITAL LETTER O WITH MACRON
0xDB    0x016A  #       LATIN CAPITAL LETTER U WITH MACRON
0xE2    0x0101  #       LATIN SMALL LETTER A WITH MACRON
0xE7    0x0113  #       LATIN SMALL LETTER E WITH MACRON
0xEE    0x012B  #       LATIN SMALL LETTER I WITH MACRON
0xF4    0x014D  #       LATIN SMALL LETTER O WITH MACRON
0xFB    0x016B  #       LATIN SMALL LETTER U WITH MACRON

With ISO-8859-13 it is possible to include macrons while only using one byte per character. Note this hex dump. [Skip ISO-8859-13]

54e7 6ee2 206b 6f65 0a                   T.n. koe.

Proper macrons within eight bits.

UTF-16

UTF-16 is a great idea where we use two bytes to represent every character. Even those characters that only need one byte, we pad them out to two. For those characters that need two bytes (eg, these macronnised ones) we use two bytes. For those characters that need three or more bytes, well, we're in trouble. Also, there is the issue of which of the sixteen bits is the big end, and so you need to shovel extra bytes so the recipient can figure out which end is which.

Here is a hex dump showing our very familiar piece of text in UTF-16 text with correct macrons. Note each character takes two bytes, and FFFE (hex) is at the start of the string. [Skip UTF-16]

fffe 5400 1301 6e00 0101 2000 6b00 6f00  ..T...n... .k.o.
6500                                     e.

Nice to see macrons are supported, but that looks like a damn mess!

You may have seen stuff like this is you've opened a Microsoft Word document in a plain text editor. There are null bytes all over the place. Microsoft Word (in some versions, between 1998 and whenever) stored the document with a mixture of UTF-16 big endian, UTF-16 little endian, and Windows 1251.

UTF-7

This is some sort of highfalutin way of jamming Unicode through email messages, which typically only like seven bits on the byte.

Here is a hex dump showing our lovely greeting in UTF-7 with correct macrons. [Skip UTF-7]

542b 4152 4d2d 6e2b 4151 4520 6b6f 65    T+ARM-n+AQE koe

Well, that was an education.

UTF-8

Now to get to the most interesting one!

UTF-8 looks like ISO-8859-1, until you put a macron in your text. Then, it very obligingly expands to gobble as many bytes as necessary to fit in that character.

The bytes C4 (hex) and C5 (hex) are big hints that a macron is heading your way. Macrons manage to fit into just two bytes, and the first byte is either C4 or C5.

Here's the table, now expanded and showing the two bytes that a macron takes up when encoded in UTF-8.

CharacterUnicodeUTF-8 (hex)
CAPITAL LETTER A WITH MACRONU+0100C4 80
SMALL LETTER A WITH MACRONU+0101C4 81
CAPITAL LETTER E WITH MACRONU+0112C4 92
SMALL LETTER E WITH MACRONU+0113C4 93
CAPITAL LETTER I WITH MACRONU+012AC4 AA
SMALL LETTER I WITH MACRONU+012BC4 AB
CAPITAL LETTER O WITH MACRONU+014CC5 8C
SMALL LETTER O WITH MACRONU+014DC5 8D
CAPITAL LETTER U WITH MACRONU+016AC5 AA
SMALL LETTER U WITH MACRONU+016BC5 AB

Shall we have a look at the hex dump? [Skip UTF-8]

54c4 936e c481 206b 6f65                 T..n.. koe

Notice two bytes for macronised characters and one byte for normal characters? I have attempted to point out the C4 93 and C4 81 in the hex dump.

Comparisons of all of the above representations

ISO-8859-1 / Latin1 (with no macrons)

5465 6e61 206b 6f65                      Tena koe

ISO-8859-1 / Latin1 (with academic macrons)

5465 656e 6161 206b 6f65                 Teenaa koe

ISO-8859-1 / Latin1 (with HTML entities)

5426 2332 3939 3b6e 2623 3235 373b 206b  Tīnā k
6f65                                     oe

ISO-8859-13

54e7 6ee2 206b 6f65 0a                   T.n. koe.

UTF-16

fffe 5400 1301 6e00 0101 2000 6b00 6f00  ..T...n... .k.o.
6500                                     e.

UTF-7

542b 4152 4d2d 6e2b 4151 4520 6b6f 65    T+ARM-n+AQE koe

UTF-8

54c4 936e c481 206b 6f65                 T..n.. koe

iconv: character set conversion

Now we have had a look at some character sets, we need to consider what we will do with our text:

If you're going to be throwing text around from one character set to another then the route to go is to check out libiconv which is very good for this sort of thing.

You can hack up your own support, for example, seeking out and mangling any characters you don't recognise, but th?t co?ld c??se pro?lems.

You could even try a hybrid approach, mangling some by hand, and then sending the rest off to iconv for dealing with.

A good idea would be to simply have UTF-8 everywhere, and all incoming data gets converted up to UTF-8.

Protocol / standards support

HTTP

Yes. Make sure you send your character set in the Content-Type response header:

Content-Type: text/html;charset=UTF-8

HTML

Yes. Just remember to make a note somewhere in your header (or HTTP response headers) that you are using UTF-8.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Remember that this is transmitted over HTTP, so you want your HTTP headers to be correct, as some user agents will prefer the HTTP headers over the HTML, and some will prefer what the HTML says it is despite the advice of the headers.

RSS

It is XML. Remember your XML declaration and any MIME headers, and you should be onto a winner. The note regarding HTML and HTTP headers also holds true.

<?xml version="1.0" encoding="UTF-8"?>

iCalendar

Assumes an octet (eight bits) equals a character. Best to convert down to ISO-8859-1 to be safe.

Software support

PHP

PHP by default works in ISO-8859-1 mode, and proudly declares that in all HTTP headers. To support UTF-8, we just need to hack in some headers to declare our output to be UTF-8.

header("Content-Type: text/html; charset=UTF-8");
printf('<?xml version="1.0" encoding="UTF-8"?>');

Note that the following appear to be completely useless so don't bother wasting time with them:

ini_set('default_charset', 'UTF-8');
ini_set('mbstring.language', 'Neutral'); # UTF-8
ini_set('mbstring.internal_encoding', 'UTF-8');
ini_set('mbstring.http_output', 'UTF-8');
iconv_set_encoding('internal_encoding', 'UTF-8');
iconv_set_encoding('output_encoding', 'UTF-8');

You can place UTF-8 into string literals, or even escape it as hex:

printf("T\xC4\x93n\xC4\x81 koe!");

The iconv() function will handle nearly all of your conversion needs. Also consider using Multibyte String Functions as replacements to your usual string functions.

Here is a quick function to convert from UTF-8 with proper macrons to ISO-8859-1 with academic macrons:

function unmacron($t)
{
        return  iconv("UTF-8", "ISO-8859-1//TRANSLIT", 
                str_replace("\xC4\x80", "AA", 
                str_replace("\xC4\x81", "aa", 
                str_replace("\xC4\x92", "EE", 
                str_replace("\xC4\x93", "ee",
                str_replace("\xC4\xAA", "II",
                str_replace("\xC4\xAB", "ii",
                str_replace("\xC5\x8C", "OO",
                str_replace("\xC5\x8D", "oo",
                str_replace("\xC5\xAA", "UU",
                str_replace("\xC5\xAB", "uu", $t)))))))))));
}

Note here my use of //TRANSLIT on the destination character set. By default you will get your string lopped off at the first character that cannot be represented in ISO-8859-1. This way you just end up with annoying question marks.

PHP with MySQL

MySQL 4.1 automatically thinks that Latin1 (case insensitive) is an awesome character set and should be used everywhere. As soon as you have connected to the database, tell that clod who is who:

mysql_query("SET NAMES utf8", $link);

Note you can have different character sets for reading and writing from the database. Very confusing! The above query sets all character sets to UTF-8.

MySQL

MySQL 4.1 and later can have different character sets on per-database, per-table, and even a per-column basis. When you're designing your database, make a sensible decision early on.

latin1_swedish_ci is the default.

utf8_general_ci is good for UTF-8.

Apache 2.0

Off the shelf, the default character set is ISO-8859-1. If you wish, you can add an extension to your file names to indicate the character set. This is exactly the same way you already indicate the content type.

Instead of calling your file uc.html, (indicating it has a MIME type of text/html, and using the default character set), rename it to uc.html.utf8 (indicating it has a MIME type of text/html, and has a character set of UTF-8). Note that the devil is in the details. Consider content-negotiation and cool URIs don't change.

These are the default configuration settings for Apache 2.0:

AddDefaultCharset ISO-8859-1
AddCharset ISO-8859-1  .iso8859-1  .latin1
AddCharset UTF-8       .utf8

Links and Credits

Thanks to Bradley Grainger from Libronix for helping explain some of these concepts to me.

The rest I learnt through hard knocks, reading documentation, and visits to this link below:

More interesting junk:


Tīpene Cope 2005-10-11
http://www.stat.auckland.ac.nz/~kimihia/unicode-macrons