100% agree. > using BOMs would require all existing code to be aware of them, ev...

nybble41 · on April 15, 2020

> You can NOT concatenate two UTF-8 streams and expect that the resulting stream is also a valid UTF-8 stream.

Actually you can; the ability to concatenate UTF-8 streams is an intentionally part of the design of UTF-8. The BOM is an ordinary Unicode code point and can occur in the middle of a valid UTF-8 stream, where it should be treated as either a zero-width non-breaking space or an unsupported character (which only affects rendering). So concatenating two UTF-8 streams with leading BOMs still results in a valid UTF-8 stream, albeit with an extra zero-width space.

The bigger problem with the BOM is that it breaks transparent compatibility with ASCII. Absent a leading BOM character, a UTF-8 steam containing only codepoints 0-127 is binary-identical to an ASCII-encoded text stream and can be handled with tools that are not UTF-8 aware. This was an explicit design consideration for both Unicode and UTF-8. Add the BOM, however, and your file is no longer plain text, which can lead to syntax errors or other issues that are difficult to diagnose because the BOM is invisible in UTF-8 aware text editors.

I think the BOM was a mistake—along with the variable-length multi-byte encodings it was created to support—but unfortunately at this point we're stuck with it. (Actually the BOM is prohibited in the multi-byte formats with an explicit byte order, like UTF-16BE; it would have been really nice if the same policy had been applied to UTF-8 where byte order is irrelevant.) The best we can do is recommend that new programs omit the BOM when outputting UTF-8 and either skip it at the beginning or convert it to U+2060 WORD JOINER anywhere else when it appears in the input.

alkonaut · on April 15, 2020

Interesting, I thought a BOM-in-the-middle was invalid. I know apps are even more likely to choke on that than a leading BOM though.

In any case, you need to handle it in every app that claims to read UTF. The loss of compatibility is indeed the biggest problem and I agree the BOM should be omitted when possible, but that doesn’t change that it’s part of the spec and millions of UTF files have a BOM.

Even if 100% of all apps stopped using a BOM today you couldn’t ignore it in a parser.