[Development] utf-8 BOM and parsers

Mon Apr 14 19:33:48 CEST 2014

Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
> Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
> started  on Windows, with tools like Notepad, where changing the system
> locale is not an option.

To be clear: BOMs are to be used to determine that the content *is* UTF-8. 
Once you know that it is UTF-8, you can strip it and pass to the decoder. 
Passing the BOM to the decoder sounds wrong because you'd be expecting ito 
choose the codec when decoding. That's what Notepad does: if there's a BOM, it 
decodes as UTF-8; otherwise it decodes as ANSI.

Having the BOM there also breaks roundtrip:

	QString bom = u"\ufeff" "any string goes here";
	QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);

QString::toUtf8 does not, cannot and will never add the BOM. It would break 
concatenation.

I know this is a behaviour change. But I repeat that it is an *intentional* 
change.

The U+FEFF character is called "zero-width non-breaking space" (ZWNBSP) 
anywhere else, so it's valid to appear there. Including the next character in 
a file.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center