Use UTF-8 in your widgets and live happily ever after
One of the requirements we insist on for developing UWA widget is to use the UTF-8 character encoding – and when I write “we insist”, we really mean it.
It’s not that we find other encodings uncool – really, ISO-8859-1 is a nice fella, and had Ken Thompson and Rob Pike not found themselves in front of paper placemat and tweaked UTF-8 into the most robust and flexible format there is, it would probably be the recommended one in UWA today. But that’s not the point.
HTML entities
The point is: ever since the Dawn of Ages (that is, 1995 and HTML 2), we’ve all been used to not trust our browsers’ encoding implementations, and to rely as often as possible on HTML Special Entities every time we wanted a special character (you know: à, é, ç, î, ø ; but there are also क, کُليل, 約, 평 and all these weird squiggly letters, globally gathered into “non-ASCII characters”) to print correctly on screen rather than, say, turn into a question mark or a white square.
Hence, a lot of French developers tend to write “élément” rather than “élément”, since at the times it proved to work on a wider array of browsers. WYSIWYG tools such as Dreamweaver even have a default configuration that renders special characters into HTML entities as they are typed.
Thing is, the UWA specification clearly says “[The widget's] file MUST be XML well-formed”, and I’ll tell you the reason why. It fits in one word: parsing. See, UWA is advertised as Universal, and as such, it is not Netvibes-API-cleverly-adapted-to-other-platforms: Netvibes is actually (and obviously) one of the supported platforms. This means that UWA widget are “transformed” (or parsed) before they are displayed within your Netvibes page, so that it works correctly. The same parsing happens for each supported platforms.
Parsers. That’s the magic behind UWA: being able to compile your UWA code into code that work for other platforms – and parsing only works if the source file (your UWA widget) respects certain expected norms. In the case of UWA, that norm is XHTML, which requires to use well-formed XML syntax. And we all know that HTML entities, as is, are a threat to XML well-formedness, their & having special meaning in XML, and requiring to be escaped (&)in order to be displayed as a character.
In practice, that means you shouldn’t use HTML entities because of their XML-breaking potential. This is especially true in UWA preferences, where many a UWA developer got bitten by using entities rather than the special character itself, thus making it impossible for the parser to compile.
I.e, this is bad and won’t parse:
<widget:preferences>
<preference name="limitvalue" type="range"
label="Nombre d'éléments à afficher"
defaultValue="5" step="1" min="1" max="10" />
</widget:preferences>
…this is good and leads to happier debugging:
<widget:preferences>
<preference name="limitvalue" type="range"
label="Nombre d'éléments à afficher"
defaultValue="5" step="1" min="1" max="10" />
</widget:preferences>
Therefore, the obvious solution: taking profit of have the UTF-8 encoding to use special characters as-is: accents and diacritics, specific punctuation, and non-ASCII characters in general, including obviouly all non-Latin writing systems.
Basic idea: you have UTF-8, so feel free to write signs the way you would normally do, not using browser tricks such as entities.
What the big guys say
If that’s not enough for you, we could add two facts: UWA widgets have to be made using well-formed XML syntax in order to be read by the UWA parser, and the XML recommendation explicitly says “All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode 3.1″.
While we are at the W3C, the Character Model for the World Wide Web 1.0: Fundamentals document mentions in passing that “when a unique character encoding is required, the character encoding must be UTF-8, UTF-16 or UTF-32.”.
Also to be noted, the IETF’s Policy on Character Sets and Languages document explicitly says that “protocols MUST be able to use the UTF-8″.
If these guys don’t make you switch to UTF-8, I don’t know what will
But wait, there’s more!
You thought simply encoding your file using UTF-8 and typing letters instead of their entity equivalent would keep you on the safe side of the fence, didn’t you? Well, it’s only partly true.
Most widget, especially those making use of Ajax methods, are built with the idea of loading external data and injecting it into the widget’s HTML code. You probably see where I’m aiming to here: because encoding mixing leads to pain and anger, external data should equally be encoded in UTF-8 – text, XML, JSON, feeds… This is a must: all that leads to your widget should be in UTF-8.
If the source happen to be some legacy system that can’t be fixed into UTF-8, you still have the choice to load it through a custom server-side script, that will take charge of transforming it from its original encoding into UTF-8. PHP can easily do this, as can most modern server-side languages.
The bottom line
There you have it. Make sure to use UTF-8, refrain from using entities, check your data’s encoding, and your widget shall be sound and safe on your users page.
Tags: , preference, utf8, uwa, widget











