[Development] utf-8 BOM and parsers
Thiago Macieira
thiago.macieira at intel.com
Mon Apr 14 18:59:18 CEST 2014
Em seg 14 abr 2014, às 18:29:26, Olivier Goffart escreveu:
> What were the reason to change that behaviour?
> Personally, I think it's safer to keep the 5.2 behaviour and avoid breaking
> user's code.
It seemed wrong when I was rewriting that ours did like that. When I wrote
tst_utf8 way back when, I had to special-case the BOM [1] because
QString::fromUtf8() would eat it, but QString::fromLocal8Bit() would not
depending on OS.
That is, if you're on Mac OS X or on Blackberry, where we hardcode that UTF-8
is the locale codec, QString::fromLocal8Bit(bomString) eats the BOM. On Linux
and other Unix where we don't hard code and fall back to either iconv of ICU,
the BOM does not get eaten.
Also, the Unix philosophy is that UTF-8 BOMs should not be used. This started
on Windows, with tools like Notepad, where changing the system locale is not
an option.
Besides, avoiding this extra test at the beginning of the codec gains a little
more in performance. That is, for the 99.99% of the strings that we do
fromUtf8() on, it is faster.
[1]
http://code.woboq.org/qt5/qtbase/tests/auto/corelib/codecs/utf8/tst_utf8.cpp.html#191
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Development
mailing list