Unicode Macrons, mainly for Te Reo Māori
Tēnā koutou! Ko Tīpene tāku ingoa.
In the above sentence you should see four macronised characters. In Te Reo Māori macrons denote long vowel sounds, which can have quite a dramatic effect on meaning, for example, keke vs kēkē (cake vs armpit).
When considering your web-enabled information superhighway content delivery system you need to first slam on the brakes and consider how you want to store your content and then deliver it. Let me show you how I do so.
The Requirements
Before designing a solution, we need to know the problem. What ways do we dish this information up as? HTML? PDF? Email? RSS feed? iCalendar? vCard? XML-based format? CSV file? Excel spreadsheet? Plain text?
A very simple way to storing information is to have the server store HTML. This works well when your only outputs are HTML webpages, RSS feeds, and HTML formatted email. If you're only going to output HTML, then we simply escape all incoming data into HTML entities, store HTML in our database and in our files, and send HTML back to the user. Very simple workflow. Very easy to implement. Congratulations you can skip the rest of this page (although you may still find it useful).
However, it may end up being better and easier to store Unicode (UTF-8) and then dish up HTML webpages as HTML and dish up additional formats with no extra conversion cost. MySQL has good character set support, so we may as well use this method.
Data flow
It pays to first consider your data flow. You should already know how people move around your website. Say they submit a form on Page A, which sends POST data to Page B, and then they get redirected to Page C to see the new item.

Behind all this there is the software on the server receiving the post data, and the database. When the browser submits the form to Page B, script P (as in POST) receives the data. It does things to the data, and then stores it in the database DB. Later, script G (as in GET) retrieves the data from the database and sends it to the browser, where it is displayed as Page C.

You need to understand this concept. Your scripts P and G are receiving and sending data from the user agent. They are communicating with the database DB. Are they receiving nonsense? Is what they send understandable?
Each box potentially can have a different type of content in it. For example, all the pages are going to be HTML. What you put in your script is up to you. When you write to the database, you must escape any potential SQL injection in the input.
Consider this:

Page B submits UTF-8 data to script P. When script P receives the data from Page B it converts the incoming data from UTF-8 to ISO-8859-1 with HTML entities in place of macrons. Before writing the data to the database, script P converts it to SQL escaped ISO-8859-1 with HTML entities. From there the entire flow is ISO-8859-1 with HTML entities.
This is workable, however, when you need to put the data into a format that does not understand HTML entities, then you run into strife.
A better idea for a workflow would be to use UTF-8. With UTF-8 we are not losing any data. Remember, some of these changes can be lossy. For example, ISO-8859-1 does not allow for macrons. It is best to store your data with the highest quality, and then export at a lower quality when required.
Also consider efficiency. Conversion between formats takes time, not much time, but still time. For every time you write to a database, you will probably read the record fifteen or more times, depending on your application and traffic. (I could have made these figures up, because I have another database that gets written to about once a year, and people read from it daily, so I can easily have the record being read a thousand times for every write.) You want to do your conversions as few times as possible. This may require a little bit of research and sensible design to figure out where you can most efficiently do the conversions. You may even wish to store multiple versions of the data, rather than converting on the fly. (This is the speed / storage space tradeoff.) I have used orange boxes to show where a conversion is taking place. They are:
- Between Page B and Script P
- Between Script P and the Database
- Between Script G and Page C (optional)
Also, do not do the same conversion in more than one place, because you can end up with double escaped characters, and those look really ridiculous. More on this later.
Cr&āzy!

Alternatively you could use UTF-8 throughout your application, and avoid escaping (except to remove sneaky HTML or SQL injection attacks, which goes without saying).
The above diagram shows a flow that consists entirely of UTF-8, with only the construction of the SQL as the time when anything gets escaped.
Now, something to think about before designing your data flow is what each of these funny names means, and what is appropriate where.
An introduction to Unicode and other character sets
This is a very brief overview. You can find better resources at www.unicode.org.
I will start this introduction to Unicode by introducing you to another character set you may be more familiar with, ISO 8859-1.
Latin1 and ISO-8859-1
While not exactly the same, these two character sets are very nearly identical. These two character sets happen to be the default for a lot of software.
There are no macrons. The closest approximation is the umlaut (two dots above the character), or you could use academic macrons
and double them, however this can be ambiguous. Doubled vowels cannot be shrunk back into macrons. For example, the whaka-
prefix on a word can cause a vowel to be doubled, but it is not a long vowel, for example: whakaata
A character is equal to a byte.
Here is a hex dump showing on the first line what happens when you simply ignore the macrons, and on the second line what it looks like when you double the characters. [Skip Latin1]
5465 6e61 206b 6f65 Tena koe 5465 656e 6161 206b 6f65 Teenaa koe
Here is a table showing the different macronised characters of interest, their Unicode code point, and an ISO-8859-1 approximation:
| Character | Unicode | ISO-8859-1 approximation |
|---|---|---|
| CAPITAL LETTER A WITH MACRON | U+0100 | AA |
| SMALL LETTER A WITH MACRON | U+0101 | aa |
| CAPITAL LETTER E WITH MACRON | U+0112 | EE |
| SMALL LETTER E WITH MACRON | U+0113 | ee |
| CAPITAL LETTER I WITH MACRON | U+012A | II |
| SMALL LETTER I WITH MACRON | U+012B | ii |
| CAPITAL LETTER O WITH MACRON | U+014C | OO |
| SMALL LETTER O WITH MACRON | U+014D | oo |
| CAPITAL LETTER U WITH MACRON | U+016A | UU |
| SMALL LETTER U WITH MACRON | U+016B | uu |
Table of HTML entities
One easy way to add macrons is to use the HTML entities for them. The general format is:
&#DDD;
Where DDD is the decimal value of the character. Unicode code points (U+HHHH) are in hex.
Here is a table showing the different macronised characters of interest, their Unicode code point, an ISO-8859-1 approximation, and the matching HTML entity:
| Character | Unicode | ISO-8859-1 approximation | HTML entity |
|---|---|---|---|
| CAPITAL LETTER A WITH MACRON | U+0100 | AA | Ā |
| SMALL LETTER A WITH MACRON | U+0101 | aa | ā |
| CAPITAL LETTER E WITH MACRON | U+0112 | EE | Ē |
| SMALL LETTER E WITH MACRON | U+0113 | ee | ē |
| CAPITAL LETTER I WITH MACRON | U+012A | II | Ī |
| SMALL LETTER I WITH MACRON | U+012B | ii | ī |
| CAPITAL LETTER O WITH MACRON | U+014C | OO | Ō |
| SMALL LETTER O WITH MACRON | U+014D | oo | ō |
| CAPITAL LETTER U WITH MACRON | U+016A | UU | Ū |
| SMALL LETTER U WITH MACRON | U+016B | uu | ū |
Here is a hex dump showing a piece of ISO-8859-1 text with HTML entities for macrons. [Skip ISO-8859-1]
5426 2332 3939 3b6e 2623 3235 373b 206b Tīnā k 6f65 oe
As you can see this is a lot less efficient than just using the plain characters! Or is it?
ISO-8859-13
After a little research (actually just reading James Gasson and Pablo Saratxaga's i18n localisations) I discovered that ISO-8859-13 actually includes macrons!
This extract from the ISO-8859-13 to Unicode mapping file shows in the first column the ISO-8859-13 code in hex, and in the second column the matching Unicode values. [Skip Unicode mapping]
0xC2 0x0100 # LATIN CAPITAL LETTER A WITH MACRON 0xC7 0x0112 # LATIN CAPITAL LETTER E WITH MACRON 0xCE 0x012A # LATIN CAPITAL LETTER I WITH MACRON 0xD4 0x014C # LATIN CAPITAL LETTER O WITH MACRON 0xDB 0x016A # LATIN CAPITAL LETTER U WITH MACRON 0xE2 0x0101 # LATIN SMALL LETTER A WITH MACRON 0xE7 0x0113 # LATIN SMALL LETTER E WITH MACRON 0xEE 0x012B # LATIN SMALL LETTER I WITH MACRON 0xF4 0x014D # LATIN SMALL LETTER O WITH MACRON 0xFB 0x016B # LATIN SMALL LETTER U WITH MACRON
With ISO-8859-13 it is possible to include macrons while only using one byte per character. Note this hex dump. [Skip ISO-8859-13]
54e7 6ee2 206b 6f65 0a T.n. koe.
Proper macrons within eight bits.
UTF-16
UTF-16 is a great idea where we use two bytes to represent every character. Even those characters that only need one byte, we pad them out to two. For those characters that need two bytes (eg, these macronnised ones) we use two bytes. For those characters that need three or more bytes, well, we're in trouble. Also, there is the issue of which of the sixteen bits is the big end, and so you need to shovel extra bytes so the recipient can figure out which end is which.
Here is a hex dump showing our very familiar piece of text in UTF-16 text with correct macrons. Note each character takes two bytes, and FFFE (hex) is at the start of the string. [Skip UTF-16]
fffe 5400 1301 6e00 0101 2000 6b00 6f00 ..T...n... .k.o. 6500 e.
Nice to see macrons are supported, but that looks like a damn mess!
You may have seen stuff like this is you've opened a Microsoft Word document in a plain text editor. There are null bytes all over the place. Microsoft Word (in some versions, between 1998 and whenever) stored the document with a mixture of UTF-16 big endian, UTF-16 little endian, and Windows 1251.
UTF-7
This is some sort of highfalutin way of jamming Unicode through email messages, which typically only like seven bits on the byte.
Here is a hex dump showing our lovely greeting in UTF-7 with correct macrons. [Skip UTF-7]
542b 4152 4d2d 6e2b 4151 4520 6b6f 65 T+ARM-n+AQE koe
UTF-8
Now to get to the most interesting one!
UTF-8 looks like ISO-8859-1, until you put a macron in your text. Then, it very obligingly expands to gobble as many bytes as necessary to fit in that character.
The bytes C4 (hex) and C5 (hex) are big hints that a macron is heading your way. Macrons manage to fit into just two bytes, and the first byte is either C4 or C5.
Here's the table, now expanded and showing the two bytes that a macron takes up when encoded in UTF-8.
| Character | Unicode | UTF-8 (hex) |
|---|---|---|
| CAPITAL LETTER A WITH MACRON | U+0100 | C4 80 |
| SMALL LETTER A WITH MACRON | U+0101 | C4 81 |
| CAPITAL LETTER E WITH MACRON | U+0112 | C4 92 |
| SMALL LETTER E WITH MACRON | U+0113 | C4 93 |
| CAPITAL LETTER I WITH MACRON | U+012A | C4 AA |
| SMALL LETTER I WITH MACRON | U+012B | C4 AB |
| CAPITAL LETTER O WITH MACRON | U+014C | C5 8C |
| SMALL LETTER O WITH MACRON | U+014D | C5 8D |
| CAPITAL LETTER U WITH MACRON | U+016A | C5 AA |
| SMALL LETTER U WITH MACRON | U+016B | C5 AB |
Shall we have a look at the hex dump? [Skip UTF-8]
54c4 936e c481 206b 6f65 T..n.. koe
Notice two bytes for macronised characters and one byte for normal characters? I have attempted to point out the C4 93 and C4 81 in the hex dump.
Comparisons of all of the above representations
ISO-8859-1 / Latin1 (with no macrons) 5465 6e61 206b 6f65 Tena koe ISO-8859-1 / Latin1 (with academic macrons) 5465 656e 6161 206b 6f65 Teenaa koe ISO-8859-1 / Latin1 (with HTML entities) 5426 2332 3939 3b6e 2623 3235 373b 206b Tīnā k 6f65 oe ISO-8859-13 54e7 6ee2 206b 6f65 0a T.n. koe. UTF-16 fffe 5400 1301 6e00 0101 2000 6b00 6f00 ..T...n... .k.o. 6500 e. UTF-7 542b 4152 4d2d 6e2b 4151 4520 6b6f 65 T+ARM-n+AQE koe UTF-8 54c4 936e c481 206b 6f65 T..n.. koe
iconv: character set conversion
Now we have had a look at some character sets, we need to consider what we will do with our text:
- Pretend everyone is using Latin1/ISO-8859-1, don't support macrons, and sometimes end up with a mess
- Remove anything outside the usual Latin1/ISO-8859-1 character set, and use HTML entities and academic macrons
- Code up UTF-8 support, support macrons, and convert our text less frequently
If you're going to be throwing text around from one character set to another then the route to go is to check out libiconv which is very good for this sort of thing.
You can hack up your own support, for example, seeking out and mangling any characters you don't recognise, but th?t co?ld c??se pro?lems.
You could even try a hybrid approach, mangling some by hand, and then sending the rest off to iconv for dealing with.
A good idea would be to simply have UTF-8 everywhere, and all incoming data gets converted up to UTF-8.
Protocol / standards support
HTTP
Yes. Make sure you send your character set in the Content-Type response header:
Content-Type: text/html;charset=UTF-8
HTML
Yes. Just remember to make a note somewhere in your header (or HTTP response headers) that you are using UTF-8.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Remember that this is transmitted over HTTP, so you want your HTTP headers to be correct, as some user agents will prefer the HTTP headers over the HTML, and some will prefer what the HTML says it is despite the advice of the headers.
RSS
It is XML. Remember your XML declaration and any MIME headers, and you should be onto a winner. The note regarding HTML and HTTP headers also holds true.
<?xml version="1.0" encoding="UTF-8"?>
iCalendar
Assumes an octet (eight bits) equals a character. Best to convert down to ISO-8859-1 to be safe.
Software support
PHP
PHP by default works in ISO-8859-1 mode, and proudly declares that in all HTTP headers. To support UTF-8, we just need to hack in some headers to declare our output to be UTF-8.
header("Content-Type: text/html; charset=UTF-8");
printf('<?xml version="1.0" encoding="UTF-8"?>');
Note that the following appear to be completely useless so don't bother wasting time with them:
ini_set('default_charset', 'UTF-8');
ini_set('mbstring.language', 'Neutral'); # UTF-8
ini_set('mbstring.internal_encoding', 'UTF-8');
ini_set('mbstring.http_output', 'UTF-8');
iconv_set_encoding('internal_encoding', 'UTF-8');
iconv_set_encoding('output_encoding', 'UTF-8');
You can place UTF-8 into string literals, or even escape it as hex:
printf("T\xC4\x93n\xC4\x81 koe!");
The iconv() function will handle nearly all of your conversion needs. Also consider using Multibyte String Functions as replacements to your usual string functions.
Here is a quick function to convert from UTF-8 with proper macrons to ISO-8859-1 with academic macrons:
function unmacron($t)
{
return iconv("UTF-8", "ISO-8859-1//TRANSLIT",
str_replace("\xC4\x80", "AA",
str_replace("\xC4\x81", "aa",
str_replace("\xC4\x92", "EE",
str_replace("\xC4\x93", "ee",
str_replace("\xC4\xAA", "II",
str_replace("\xC4\xAB", "ii",
str_replace("\xC5\x8C", "OO",
str_replace("\xC5\x8D", "oo",
str_replace("\xC5\xAA", "UU",
str_replace("\xC5\xAB", "uu", $t)))))))))));
}
Note here my use of //TRANSLIT on the destination character set. By default you will get your string lopped off at the first character that cannot be represented in ISO-8859-1. This way you just end up with annoying question marks.
PHP with MySQL
MySQL 4.1 automatically thinks that Latin1 (case insensitive) is an awesome character set and should be used everywhere. As soon as you have connected to the database, tell that clod who is who:
mysql_query("SET NAMES utf8", $link);
Note you can have different character sets for reading and writing from the database. Very confusing! The above query sets all character sets to UTF-8.
MySQL
MySQL 4.1 and later can have different character sets on per-database, per-table, and even a per-column basis. When you're designing your database, make a sensible decision early on.
latin1_swedish_ci is the default.
utf8_general_ci is good for UTF-8.
Apache 2.0
Off the shelf, the default character set is ISO-8859-1. If you wish, you can add an extension to your file names to indicate the character set. This is exactly the same way you already indicate the content type.
Instead of calling your file uc.html, (indicating it has a MIME type of text/html, and using the default character set), rename it to uc.html.utf8 (indicating it has a MIME type of text/html, and has a character set of UTF-8). Note that the devil is in the details. Consider content-negotiation and cool URIs don't change.
These are the default configuration settings for Apache 2.0:
AddDefaultCharset ISO-8859-1 AddCharset ISO-8859-1 .iso8859-1 .latin1 AddCharset UTF-8 .utf8
Links and Credits
Thanks to Bradley Grainger from Libronix for helping explain some of these concepts to me.
The rest I learnt through hard knocks, reading documentation, and visits to this link below:
More interesting junk: