[Development] utf-8 BOM and parsers

Tue Apr 15 09:09:02 CEST 2014

> -----Original Message-----
> From: development-bounces+kai.koehne=digia.com at qt-project.org
> [mailto:development-bounces+kai.koehne=digia.com at qt-project.org] On
> Behalf Of Thiago Macieira
> Sent: Monday, April 14, 2014 7:34 PM
> To: development at qt-project.org
> Subject: Re: [Development] utf-8 BOM and parsers

Hi Thiago,

Thanks for listening the reasons here in detail!

> Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
> > Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
> > started  on Windows, with tools like Notepad, where changing the
> > system locale is not an option.

It's mostly an issue with (files edited on) Windows, indeed. 

> To be clear: BOMs are to be used to determine that the content *is* UTF-8.
> Once you know that it is UTF-8, you can strip it and pass to the decoder.
> Passing the BOM to the decoder sounds wrong because you'd be expecting
> ito choose the codec when decoding. That's what Notepad does: if there's a
> BOM, it decodes as UTF-8; otherwise it decodes as ANSI.

Right. But the issue is that the 'easiest' way to get a file into a qstring so far is

QFile file;
// ...
QString::fromUtf8(file.readAll());

We're using that pattern btw in both Qt and Qt Creator, too. This breaks now in ways that can be pretty subtle (given that it only affects files starting with a BOM, and that the BOM isn't displayed usually).

> Having the BOM there also breaks roundtrip:
> 
> 	QString bom = u"\ufeff" "any string goes here";
> 	QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);
> 
> QString::toUtf8 does not, cannot and will never add the BOM. It would break
> concatenation.

So you'd have to add a BOM explicitly to the file before writing, if you really want it.

> I know this is a behaviour change. But I repeat that it is an *intentional*
> change.
>
> The U+FEFF character is called "zero-width non-breaking space" (ZWNBSP)
> anywhere else, so it's valid to appear there. Including the next character in a
> file.

Right, though I understood this is deprecated since Unicode 3.2 (released in 2002).

All in all, I see a lot of code breaking with this change ... Given that, I'd like to give a +1 for reverting to the behavior for 5.3 from my side. 

My 2 cents

Kai