Peter Bissmire

Communications & Language Services

Technical and general translations, French/German -> English

18-08-08

The BOM bomb

After a period of proliferating text encoding standards (my browser lists 27 and I am sure there have been many more), the Web standard has now been fixed as UTF-8 (8-bit Unicode transformation format). A part of Unicode is the byte order mark (BOM, a kind of magic number) at the beginning of the file. The best way to understand this is to consider UTF-16. To construct 16-bit words from a byte oriented file, pairs of bytes must be "glued together" in the correct order. The UTF-16 BOM is hex. FE FF (big-endian) or FF FE (little-endian). If, when it starts to read the file, your system sees FFFE then it knows it's got the byte order wrong and must change it before continuing. The BOM is optional. If it is absent, big-endian is presumed.

UTF-8 encodes in strings of 1, 2, 3 or 4 bytes. Since they are not concatenated to form words, a BOM is redundant. There is, nevertheless, a BOM for UTF-8; it is hex. EF BB BF. Software that does not expect to find non-ASCII characters at the start of a file will be confused by this. The legacy CP1252 character representation of the BOM is , which you may have seen from time to time. It can be argued that the presence of the BOM identifies the content as a UTF variety (a Unicode signature). This is fine so long as, in every other character encoding scheme, starting a file with these bytes is forbidden or virtually impossible. Unfortuanately, this is not sufficiently the case so the only way of being sure is an external content encoding declaration such as an HTML meta tag.

How the BOM bombs
It can be argued that software utilising a UTF file should note the BOM, if present, and then strip it from the file before processing the content. However, this is often not the case.
Script files start with a shebang (#!) line to tell the system where to find the interpreter for the script — another magic number. Preceding it with a BOM can derail the system and prevent the script from running.
Include files will be interpolated with other files. If they start with a BOM, this will now be mid-text. Unicode says this should be treated as a zero-width, non-breaking space. In PHP includes, the usual result is, incorrectly?, an initial blank line, the equivalent of HTML <br/>.
Microsoft appears to have drifted in the past from "the BOM is optional so we might as well play safe and insert it" towards "the BOM is obligatory". In particular, Notepad always saves UTF-8 files with a BOM and certain MS software will not work without it.

Solutions
Notepad++ and Programmer's Notepad are examples of text editors that offer the option of saving UTF-8 without BOM. In addition, they provide programmer aids such as colour highlighting according to what you are programming, line numbering, adjustable indenting and auto-completion of, for example, brackets and HTML tag pairs.

Sources
The information presented here is a synthesis drawing on Wikipedia and various discussion group postings.