8 Nov 2008

Fixing broken characters in ISO Latin-1 databases

The Events Calendar on our website dates back to 2003, and database behind it uses the ISO/IEC 8859-1 encoding method that was current back then. We see more and more departments putting in event information with full diacritical support (such as "César Chávez"), as support in operating systems and keyboard layouts gets better and easier to use. The actual calendar pages use the old ISO Latin system for their character encoding, and since the only thing on those pages is the calendar itself, it's always worked just fine.

But we had recently transitioned the front page to UTF-8, in order to simplify editing text there. This led to an unfortunate case of Mojibake when events with upper-ascii in their titles hit our UTF-8 home page, as they do starting 7 days before their occurrence:

I considered doing a mass conversion of the database to UTF-8 encoding, but the narratives of people trying to do this on the web are pretty hairy. A better solution seemed to be to keep everything (underlying data and the specific calendar pages themselves) in legacy Latin, and transform the strings in real time when they were extracted from the database on the home page via php. There turns out to be a useful command for this: utf8_encode.

$uperson = utf8_encode($person);

This brings the character encoding into alignment with the rest of the page, while not generating gratuitous encoding headaches (and possible catastrophes) with our production event database.

Previous: | Next: