calabash's blog: How to Determine Text File Encoding

Submitted by Ben Bryant on Tue, 2005-09-13 06:50.
http://codesnipers.com/?q=node/68

With the explosion of international text resources brought by the Internet, the standards for determining file encodings have become more important. This is my attempt at making the text file encoding issues digestible by leaving out some of the unimportant anecdotal stuff. I'm also calling attention to blunders in the MSDN docs.

For Unicode files, the BOM ("Byte Order Mark" also called the signature or preamble) is a set of 2 or so bytes at the beginning used to indicate the type of Unicode encoding. The key to the BOM is that it is generally not included with the content of the file when the file's text is loaded into memory, but it may be used to affect how the file is loaded into memory. Here are the most important BOMs and the encodings they indicate:

FF FE UCS-2LE or UTF-16LE
FE FF UCS-2BE or UTF-16BE
EF BB BF UTF-8

BOM stands for Byte Order Mark which literally is meant to distinguish between little-endian LE and big-endian BE byte order by representing the code point U+FEFF ("zero width no-break space"). Since 0xFFFE (bytes swapped) is not a valid Unicode character, you can be sure of the byte order. Note that the BOM does not distinguish between UCS-2 and UTF-16 (they are the same except that UTF-16 has surrogate pairs to represent more code points). The UTF-8 BOM is not concerned with "byte order," but it is a unique enough sequence (representing the same U+FEFF code point) that would be very unlikely to be found in a non-UTF-8 file, thus distinguishing UTF-8 especially from other "ASCII-based" encodings.

The BOM is not guaranteed to be in every Unicode file. In fact the BOM is not necessary in XML files because the Unicode encoding can be determined from the leading less than sign. If the first byte is an ASCII less than sign 3C followed by a non-zero byte then the file is assumed to be UTF-8 until otherwise specified in the XML Declaration. If the XML file begins with 3C 00 it is UCS-2LE or UTF-16LE (normal for Windows Unicode files). If the XML file begins with 00 3C it is UCS-2BE or UTF-16BE. In summary then:

3C 00 UCS-2LE or UTF-16LE
00 3C UCS-2BE or UTF-16BE
3C XX UTF-8 (where XX is non-zero)

Unlike the BOM, these bytes are part of the text content of the file.

Also, the BOM is not used to determine non-Unicode encodings such as Windows-1250, GB2312, Shift-JIS, or even EBCDIC. Generally you should have some way of knowing the encoding due to the context of your situation. However, markup languages have ways of specifying the encoding in the markup near the top of the file. The following appears inside the area of an HTML page:

charset=utf-8" />

In XML, the XML declaration specifies the encoding.

encoding="iso-8859-1"?>

This works because all of the encodings are compatible enough with ASCII in the lower 128, so the encoding can be specified as an ASCII string. The alternative to ASCII is EBCDIC; if an XML file begins with the EBCDIC less than sign 4C it is treated as EBCDIC and the XML declaration (if present) would be processed in standard EBCDIC to determine the extended EBCDIC variant.

If I hadn't seen fundamental problems in Microsoft documentation before it would be inconceivable to me that in articles about the Unicode BOM, the BOM is shown backwards, but it is true! Both the Byte-order Mark and IsTextUnicode MSDN articles get the UTF-16 BOM bytes backwards. Herfried K. Wagner [MVP] agreed that the documentation is wrong (plain as day but it is still good to be MVP corroborated).

Another thing that irks me is the IsTextUnicode API. This API is so useless it bears mentioning. First of all, it does not deal with UTF-8, so it can only be one part of the way Notepad detects Unicode. Raymond Chen said it is used in Notepad but Notepad actually auto-detects UTF-8 as explained below (his blog, The Old New Thing is one of my three favorites). Secondly, when text is in memory, it should usually not have the BOM with it since it can only be efficient to interrogate the BOM before loading the file. Thirdly, if you cannot guarantee the statistical analysis, you are better off using a simple deterministic mechanism that you can explain to the end user so that they don't experience seemingly erratic behavior.

UTF-8 auto-detection (when there is no UTF-8 BOM) is not widely documented (although it is exhibited by Windows Notepad). UTF-8 multi-byte sequences have a pattern that would be almost impossible to accidentally produce in another encoding. The more non-ASCII characters a UTF-8 file has, the more certain you can be that it is UTF-8, but my impression is that you don't need many.

Without going into the full algorithm, just perform UTF-8 decoding on the file looking for an invalid UTF-8 sequence. The correct UTF-8 sequences look like this:

0xxxxxxx ASCII <>
110xxxxx 10xxxxxx 2-byte >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte >= 0x400
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte >= 0x10000

If the file is all ASCII, you cannot know if it was meant to be UTF-8 or any other character set compatible with ASCII in the lower 128.

Without contextual information, a BOM, or a file type standard with a header like XML and HTML, a file should be assumed to be in the default system locale ANSI code page, governed by the Language for non-Unicode Programs in the Regional Settings on the computer on which it is found.

calabash's blog

2008年9月26日星期五

How to Determine Text File Encoding

没有评论:

文章分类

我的简介

博客归档