[Development] utf-8 BOM and parsers

Mon Apr 14 18:59:18 CEST 2014

Em seg 14 abr 2014, às 18:29:26, Olivier Goffart escreveu:
> What were the reason to change that behaviour?
> Personally, I think it's safer to keep the 5.2 behaviour and avoid breaking 
> user's code.

It seemed wrong when I was rewriting that ours did like that. When I wrote 
tst_utf8 way back when, I had to special-case the BOM [1] because 
QString::fromUtf8() would eat it, but QString::fromLocal8Bit() would not 
depending on OS.

That is, if you're on Mac OS X or on Blackberry, where we hardcode that UTF-8 
is the locale codec, QString::fromLocal8Bit(bomString) eats the BOM. On Linux 
and other Unix where we don't hard code and fall back to either iconv of ICU, 
the BOM does not get eaten.

Also, the Unix philosophy is that UTF-8 BOMs should not be used. This started 
on Windows, with tools like Notepad, where changing the system locale is not 
an option.

Besides, avoiding this extra test at the beginning of the codec gains a little 
more in performance. That is, for the 99.99% of the strings that we do 
fromUtf8() on, it is faster.

[1] 
http://code.woboq.org/qt5/qtbase/tests/auto/corelib/codecs/utf8/tst_utf8.cpp.html#191
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center