[Development] utf-8 BOM and parsers
Koehne Kai
Kai.Koehne at digia.com
Tue Apr 15 09:09:02 CEST 2014
> -----Original Message-----
> From: development-bounces+kai.koehne=digia.com at qt-project.org
> [mailto:development-bounces+kai.koehne=digia.com at qt-project.org] On
> Behalf Of Thiago Macieira
> Sent: Monday, April 14, 2014 7:34 PM
> To: development at qt-project.org
> Subject: Re: [Development] utf-8 BOM and parsers
Hi Thiago,
Thanks for listening the reasons here in detail!
> Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
> > Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
> > started on Windows, with tools like Notepad, where changing the
> > system locale is not an option.
It's mostly an issue with (files edited on) Windows, indeed.
> To be clear: BOMs are to be used to determine that the content *is* UTF-8.
> Once you know that it is UTF-8, you can strip it and pass to the decoder.
> Passing the BOM to the decoder sounds wrong because you'd be expecting
> ito choose the codec when decoding. That's what Notepad does: if there's a
> BOM, it decodes as UTF-8; otherwise it decodes as ANSI.
Right. But the issue is that the 'easiest' way to get a file into a qstring so far is
QFile file;
// ...
QString::fromUtf8(file.readAll());
We're using that pattern btw in both Qt and Qt Creator, too. This breaks now in ways that can be pretty subtle (given that it only affects files starting with a BOM, and that the BOM isn't displayed usually).
> Having the BOM there also breaks roundtrip:
>
> QString bom = u"\ufeff" "any string goes here";
> QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);
>
> QString::toUtf8 does not, cannot and will never add the BOM. It would break
> concatenation.
So you'd have to add a BOM explicitly to the file before writing, if you really want it.
> I know this is a behaviour change. But I repeat that it is an *intentional*
> change.
>
> The U+FEFF character is called "zero-width non-breaking space" (ZWNBSP)
> anywhere else, so it's valid to appear there. Including the next character in a
> file.
Right, though I understood this is deprecated since Unicode 3.2 (released in 2002).
All in all, I see a lot of code breaking with this change ... Given that, I'd like to give a +1 for reverting to the behavior for 5.3 from my side.
My 2 cents
Kai
More information about the Development
mailing list