Character Encoding Issues and the Mobile Web

Character encoding, the binary representation underlying every symbol in documents delivered to mobile devices, is often treated as an afterthought in mobile Web development. Many developers simply rely upon ISO-8859-1; not a bad choice, as this encoding efficiently supports all important Western European languages, has long been available in the mobile and fixed Internet, is widespread among low-end phones, and is the default encoding in the HTTP standard. Astute software engineers prefer UTF-8; this encoding supports Unicode, and hence the widest range of languages and associated glyphs – great for multilingual “world” applications. It is also a default for several application formats, most popular in the WWW, and available in newer mobile terminals. Even in Japan, i-mode gateways may take care of the complex mapping from UTF-8 to Shift_JIS for Shift_JIS-only capable devices.

Producing content in one of these two major encodings and configuring a WWW server to advertise the content type and encoding properly generally suffices for mainstream applications. There can be complications however – especially when dealing with advanced functions or developing for exotic markets – and it is preferable to be aware of them. The article examines the situation in the context of mobile browsing.

The Document Character Set

Let us briefly recapitulate some concepts. A character set is a repertoire of abstract symbols (e.g. lowercase “a” with acute accent, uppercase alpha). Each character is mapped to a code point in a numeric space (resp. 0x00E1 and 0x0391 in the ISO-10646 space, or 225 in ISO-8859-1 and 193 in ISO-8859-7). Characters may correspond to several code points (as with Arabic letters, for which several forms must be distinguished). Finally, each code point is represented as bits and bytes depending on the character encoding scheme. Each of the 15 ISO-8859 code spaces has just one single byte encoding. ISO-10646 has two possible encodings: UCS-2 and UCS-4, using 2 and 4 bytes respectively. Unicode, whose code space is equivalent to ISO-10646, has, among others: UTF-8 (1 to 4 bytes), UTF-32 (4 bytes, with endian orderings), UTF-16 (2 or 4 bytes, with endian orderings), GB18030 (Chinese characters, 1, 2 or 4 bytes). Shift_JIS is a multibyte character encoding, with sequences to access both Japanese code spaces JIS-X-0201 and JIS-X-0208. Many encoding schemes, including UTF-8, all ISO-8859, and to a large extent Shift_JIS, comprise US-ASCII as a compatible subset – the basis for many protocols and formats on the Internet.

Web applications, as embodied in markup documents, operate within a specific character set. In practice, one must take care of this in two situations:

When using numeric character references (such as á for acute-accented a, or E1 in WCSS). The numeric value must then identify a legal code point of the document character set.
When embedding non-standard characters, called pictograms, in the document. These characters must correspond to available code points in a part reserved for user-defined symbols of the document character set.

Since its origins, HTML specifies its document character set to be ISO-10646, with some US-ASCII control characters left unused. The standard further defines a number of special entities (& > < ") and character entities (such as á and Α) that must be rendered by user-agents; this set corresponds to ISO-8859-1 in HTML 3.2, and has been extended in version 4.0 of the standard. The XML specification stipulates the document character set to be ISO-10646 too, which therefore applies to relevant XML dialects (WML, XHTML basic and XHTML mobile profile). XML just defines the same special character entities as in HTML, with the supplementary '. The WAP standard provides special-purpose markup to insert pictograms into pages written in WML and XHTML mobile profile, so that in these cases one need not mess with non-standard symbols in reserved code points. ISO-10646 is also the document character set of CSS and WCSS; symbols can be designated by escape sequences of the form NNNNNN standing for their hexadecimal code point.

One must distinguish between the document character set and the document encoding: it is possible to format an HTML or XML document with a non-Unicode, non-ISO-10646 encoding scheme (say ISO-8859-7), as long as the characters which fall outside the code space of the document encoding scheme (in this case, ASCII and Greek letters) are represented via appropriate Unicode-compliant numeric entities. The W3C standard does not define a default encoding for HTML documents; the default encoding for XML documents is UTF-8 or UTF-16.

Local standards may depart markedly from the norms set by the WWW Consortium. In particular, the major Japanese mobile operators (DoCoMo, Softbank, KDDI and Willcom) have been developing and documenting their mobile Web environments for a long time. There, the document character set is often conflated with the document encoding, and pictograms (even equivalent ones) are placed at different positions in private areas of the relevant code spaces.

In the case of i-mode, this means that numeric character references apply to code points in the Shift_JIS space if the document is encoded with Shift_JIS, and in the Unicode space if the document is encoded with UTF-8. Pictograms are represented directly as Shift_JIS bytes, by decimal numeric references pointing in the Shift_JIS reserved code space (0xF89F-0xF95E and 0xF9B1-0xF9FC), or by hexadecimal numeric references pointing in the Unicode code space (0xE63E-0xE6BA and 0xE70C-0xE757). In Europe, supported encodings are usually Windows-1252 or ISO-8859-1, and pictograms are represented via decimal numeric references in the ranges 0xE63E-0xE6A5 and 0xE6CE-0xE757 in the Unicode space.

Willcom follows a scheme similar to i-mode, except that the code space reserved for pictograms is different (0xF040 to 0xF14D in Shift_JIS).

KDDI deploys Openwave browsers; hence, pictograms are embedded in Web pages via encoding-independent markup. With other applications (e.g. e-mail), pictograms are inserted directly as special byte sequences. Shift_JIS is the preferred document encoding.

Softbank phones support EUC-JP, ISO-2022-JP, Shift_JIS encodings; newer devices (since the series “W” and “3G”) also support UTF-8, and their browsers handle numeric character references. Pictograms are entered as special byte sequences.

Clearly, a developer must first ascertain the exact document character set manipulated in the target environment, how it differs from standards, its relation to the document encoding, and the mapping of proprietary symbols. This applies to further applications as well: the Java programming language, for instance, specifies the application character set to be Unicode, provides a notation for numeric character entities, and lists the encodings to be supported by a compliant terminal. However, many small-footprint versions of the Java run-time have more limited capabilities: for instance many Java-capable Motorola handsets only handle UCS-2 and ISO-8859-1 encodings.

The Document Character Encoding

Since one can represent every symbol in a document character set by a numeric entity, would it not be straightforward to encode every HTML, WML or XHTML page entirely in US-ASCII, with all non-ASCII characters appearing as numeric references? This approach is technically feasible, but exhibits several shortcomings:

This kind of formatting reduces the legibility of documents. Furthermore, editors and authoring tools might not have in-built support to manipulate numeric references, forcing one to type á explicitly instead of simply á. All the more so, since the notation for numeric references in style sheets differs from the one in the enclosing markup document – it is E1 for á, in external style sheets, in those embedded in






Content
Form attribute "accept-charset" defined


Markup format
 No
 Yes


HTML
 3.2
 4.0


XHTML basic
 1.0
 1.1


XHTML mobile profile 
 1.0, 1.1 
 1.2


WML 
2.0 (
)

1.1, 1.2, 1.3, 2.0 ()


The special versions of HTML and XHTML designed by Japanese operators seldom implement "accept-charset" in forms, since content is supposed to be in Shift_JIS anyway. 
The fact that a browser decodes input in a certain range of encodings does not imply that it can produce output in the same range of encodings. In fact, the latter set is usually significantly smaller than the former one. The values in the "accept-charset" form attribute and the encoding of the Web page itself must take this further constraint into account. Prime candidate encodings derived from the HTTP header (preferably those with a q-value of 1), the user agent profile and manufacturer's technical manuals are usually universal encodings such as UTF-8 and UCS-2, and dominant local schemes (ISO-8859-1, Shift_JIS, etc).
Interactions with other user agents - for instance via mmsto: and mailto: URI schemes - raise similar difficulties: MMS readers and e-mail clients have their own restrictions regarding the allowable input and output character encodings - and these might not be exactly the same as for the browser. The situation gets thornier when accessing the wireless telephony application interface: which characters can be stored in the phonebook? Which symbols can be included in an SMS? Unfortunately, these functions, except for MMS, are not subject to a normalized description in the user agent profile, and (except perhaps in Japan) may not be explained in enough detail in the readily available developers' documentation. Looking at non-Internet norms helps: if the smsto: URI scheme seems to behave haphazardly, a check whether it is not actually implementing the default 7-bit encoded alphabet of GSM 03.38 is in order…
Internationalization in service platforms is an issue whose comprehensive exposition is beyond the scope of a short paper. In short, tools that are not natively designed around Unicode hinder the development of internationalized applications. We pinpoint three essential facets:

The character set used internally. Modern software systems rely upon Unicode (frequently encoded as UCS-2 or UTF-16), and are thus able to process, compare and sort strings with few restrictions. Other environments carry a legacy of having been originally built for one-byte character sets (i.e. US-ASCII, ISO-8859-1); multi-byte string manipulation routines may exist, but often do not implement all required functions, exhibit inconsistent capabilities with respect to character encodings, or lack internationalization support for crucial services entirely (e.g. sorting). PHP, for instance, is still affected by these shortcomings.

The encodings used for information stored persistently. It may be possible to encode and save data (respectively, decode and read it in) in a number of character encodings. Thus, MySQL allows database administrators to specify the character encoding of text attributes at the level of databases, tables, and individual columns, augmented with a language-specific collating sequence (i.e. the sorting order stating that "ä" is sorted as "ae" in German, but comes after "z" and "å" in Finnish). The DBMS sorts data according to the collating sequence when a query is executed. Conversely, prudent programmers keep their PHP source code in ASCII - or at least in a single-byte encoding format.

The encoding parameters when communicating with client applications. Generally, it is possible to set up the encoding for outbound data, and the encoding which is assumed for inbound data (e.g. via SET CHARACTER SET in MySQL; via functions mb_http_input, mb_output_handler, and mbstring run-time configuration variables in PHP). Sometimes, as in PHP, an auto-detection scheme is relied upon when several possible input encodings are expected. Data is automatically converted from the internal character set to the output encoding, and from the input encoding to the internal character set. 

Overall, the goal is to ensure compatibility between the character encodings accepted and produced by different units in the service delivery chain. For performance reasons, the selected encodings should actually be the internal character set of the various components. 
Concluding Remarks
Ultimately, the most severe constraints regarding character encoding are imposed by market requirements: languages spoken in a country, scripts used to write, capabilities of phones released to customers, format of source material incorporated in Web sites, etc. From this perspective, mobile developers can rarely avoid dealing with well-established regional character sets such as Big5, GB2312, GB18030, KOI8-R, TIS-620, Shift_JIS, or ISO-2022-JP entirely. For generic or multilingual applications, the observations in the introduction apply: ISO-8859-1 is an efficient encoding for Western languages - which, because of the dissemination of the English, French, Portuguese and Spanish languages, is applicable in a large number of countries all around the world - while UTF-8 is well-suited to international services. The table in the appendix, derived from 4295 user agent profiles, shows the relative importance of universal and regional character sets supported by mobile browsers. These statistics are an approximation, since they evidently underestimate the properties of those models that do not publish any user agent profile: many WAP 1 handsets, low-end phones, and a large range of Japanese terminals.
Internet standards (www.ietf.org, www.w3.org) address in detail many questions regarding internationalization, but, as already mentioned, are not undisputedly authoritative because of the prevalence of market-specific solutions, foremost in such Asian countries as Japan and Korea. There, the developers' documentation published by operators constitutes the reference. The W3C site provides a wealth of tutorials, hands-on FAQ, and reference documents about internationalization (www.w3.org/International). Other sites that delve in depth into the concepts and practical difficulties with character sets can be found at www.alanwood.net/unicode and www.cs.tut.fi/~jkorpela/chars/index.html. The various national and international norms (especially www.unicode.org) remain indispensable for those developers who must implement complex encoding, decoding and typesetting utilities.
APPENDIX: Frequency of Supported Charsets
Charsets are sorted according to decreasing frequencies of appearance in user agent profiles, and arranged in regional groups. Those charsets mentioned in less than 0.82% of the profiles are summed up under the category "other"; we observe the presence of a long tail of special encodings for various Asian languages (Tamil and Vietnamese, besides further charsets for Chinese, Japanese, Korean and Thai). A few profiles, marked "none" do not declare any supported charset. ASCII is classified as the lowest common denominator amongst encodings.


Charset
World
universal, misc.
Europe
West, Centre, North
Europe
Cyrillic, Greek
Far East
Ch. Jap. Kor. Thai,

  etc
Near East
Turk. Hebr. Arabic


utf-8
95.9 %  
 
 
 
 


us-ascii
87.0 %
 
 
 
 


iso-8859-1
 
84.9 %
 
 
 


ucs-2
65.4 %
 
 
 
 


utf-16
26.8 %
 
 
 
 


koi8-r
 
 
10.4 

  %
 
 


iso-8859-2
 
9.0 %
 
 
 


iso-8859-7
 
 
8.9 %
 
 


iso-8859-5
 
 
8.3 %
 
 


big5
 
 
 
8.2 %
 


iso-8859-9
 
 
 
 
8.2 %


iso-8859-4
 
7.9 %
 
 
 


windows-1252
 
7.5 %
 
 
 


windows-1250
 
7.0 %
 
 
 


shift_jis
 
 
 
6.7 %
 


windows-1253
 
 
6.6 %
 
 


windows-1254
 
 
 
 
6.6 %


euc-jp
 
 
 
5.6 %
 


iso-2022-cn
 
 
 
5.3 %
 


iso-2022-jp
 
 
 
5.3 %
 


gb2312
 
 
 
5.2 %
 


iso-8859-3
 
 
 
 
5.2 %


iso-8859-6
 
 
 
 
4.9 %


iso-8859-8
 
 
 
 
4.9 %


iso-8859-10
 
4.3 %
 
 
 


iso-8859-15
 
3.9 %
 
 
 


iso-8859-8-i
 
 
 
 
3.8 %


windows-1256
 
 
 
 
3.7 %


windows-1257
 
3.7 %
 
 
 


windows-1251
 
 
3.7 %
 
 


windows-1255
 
 
 
 
3.7 %


cp936
 
 
 
3.6 %
 


euc-kr
 
 
 
3.5 %
 


gb18030
 
 
 
3.4 %
 


ks-c-5601
 
 
 
3.3 %
 


tis-620
 
 
 
3.3 %
 


utf-7
3.3 %
 
 
 
 


ucs-4
3.2 %
 
 
 
 


iso-8859-13
 
1.2 %
 
 
 


iso-8859-14
 
1.2 %
 
 
 


other
0,4 %
0.7 %
0.6 %
2.8 %
0.3 %


none
1.7 %

Categories

Popular Topics

Character Encoding Issues and the Mobile Web

The Document Character Set

The Document Character Encoding

Concluding Remarks

APPENDIX: Frequency of Supported Charsets

Other Products

Content	Form attribute "`accept-charset`" defined
Markup format	No	Yes
HTML	3.2	4.0
XHTML basic	1.0	1.1
XHTML mobile profile	1.0, 1.1	1.2
WML	2.0 ( )	1.1, 1.2, 1.3, 2.0 ()

Charset	World universal, misc.	Europe West, Centre, North	Europe Cyrillic, Greek	Far East Ch. Jap. Kor. Thai, etc	Near East Turk. Hebr. Arabic
utf-8	95.9 %
us-ascii	87.0 %
iso-8859-1		84.9 %
ucs-2	65.4 %
utf-16	26.8 %
koi8-r			10.4 %
iso-8859-2		9.0 %
iso-8859-7			8.9 %
iso-8859-5			8.3 %
big5				8.2 %
iso-8859-9					8.2 %
iso-8859-4		7.9 %
windows-1252		7.5 %
windows-1250		7.0 %
shift_jis				6.7 %
windows-1253			6.6 %
windows-1254					6.6 %
euc-jp				5.6 %
iso-2022-cn				5.3 %
iso-2022-jp				5.3 %
gb2312				5.2 %
iso-8859-3					5.2 %
iso-8859-6					4.9 %
iso-8859-8					4.9 %
iso-8859-10		4.3 %
iso-8859-15		3.9 %
iso-8859-8-i					3.8 %
windows-1256					3.7 %
windows-1257		3.7 %
windows-1251			3.7 %
windows-1255					3.7 %
cp936				3.6 %
euc-kr				3.5 %
gb18030				3.4 %
ks-c-5601				3.3 %
tis-620				3.3 %
utf-7	3.3 %
ucs-4	3.2 %
iso-8859-13		1.2 %
iso-8859-14		1.2 %
other	0,4 %	0.7 %	0.6 %	2.8 %	0.3 %
none	1.7 %

Categories

Popular Topics

The Document Character Set

The Document Character Encoding

Concluding Remarks

APPENDIX: Frequency of Supported Charsets

Newsletter

Exclusive tips, how-tos, news and comment

Other Products