[Development] utf-8 BOM and parsers

Knoll Lars Lars.Knoll at digia.com
Tue Apr 22 14:35:33 CEST 2014


Hi,

Just came back from vacation today.

Unfortunately BOM’s at the beginning of files seem to still be used quite
a bit esp. in the Windows world. So I would actually vote for option 1 and
rather keep compatibility. Reason is that stripping the BOM will not break
anything, but leaving it in will.

We could also consider using our builtin utf8 decoder for all utf8
locales, so that we don’t use iconv or ICU if the locale is utf-8 (and
thus always strip the BOM). That would at least give us consistent cross
platform behaviour.

Cheers,
Lars

On 16/04/14 17:03, "Thiago Macieira" <thiago.macieira at intel.com> wrote:

>Em seg 14 abr 2014, às 10:33:48, Thiago Macieira escreveu:
>> Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
>> > Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
>> > started  on Windows, with tools like Notepad, where changing the
>>system
>> > locale is not an option.
>> 
>> To be clear: BOMs are to be used to determine that the content *is*
>>UTF-8.
>> Once you know that it is UTF-8, you can strip it and pass to the
>>decoder.
>> Passing the BOM to the decoder sounds wrong because you'd be expecting
>>ito
>> choose the codec when decoding. That's what Notepad does: if there's a
>>BOM,
>> it decodes as UTF-8; otherwise it decodes as ANSI.
>> 
>> Having the BOM there also breaks roundtrip:
>> 
>> 	QString bom = u"\ufeff" "any string goes here";
>> 	QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);
>> 
>> QString::toUtf8 does not, cannot and will never add the BOM. It would
>>break
>> concatenation.
>> 
>> I know this is a behaviour change. But I repeat that it is an
>>*intentional*
>> change.
>> 
>> The U+FEFF character is called "zero-width non-breaking space" (ZWNBSP)
>> anywhere else, so it's valid to appear there. Including the next
>>character
>> in a file.
>
>Lars, can you make a call?
>
>Options are:
>1) revert to old behaviour, change the content creators to never add a BOM
>
>2) same as above, but fix the parsers now and change the behaviour in
>QString 
>in Qt 5.4 or 5.5
>
>3) keep the new behaviour, document it in the changelog, change the
>content 
>creators as above, and fix the parsers
>
>-- 
>Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel Open Source Technology Center
>
>_______________________________________________
>Development mailing list
>Development at qt-project.org
>http://lists.qt-project.org/mailman/listinfo/development



More information about the Development mailing list