[Development] utf-8 BOM and parsers

Tue Apr 15 13:13:53 CEST 2014

On Tuesday 15 April 2014, Koehne Kai wrote:
> > -----Original Message-----
> > From: development-bounces+kai.koehne=digia.com at qt-project.org
> > [mailto:development-bounces+kai.koehne=digia.com at qt-project.org] On
> > Behalf Of Thiago Macieira
> > Sent: Monday, April 14, 2014 7:34 PM
> > To: development at qt-project.org
> > Subject: Re: [Development] utf-8 BOM and parsers
> 
> Hi Thiago,
> 
> Thanks for listening the reasons here in detail!
> 
> > Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
> > > Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
> > > started  on Windows, with tools like Notepad, where changing the
> > > system locale is not an option.
> 
> It's mostly an issue with (files edited on) Windows, indeed.
> 
> > To be clear: BOMs are to be used to determine that the content *is*
> > UTF-8. Once you know that it is UTF-8, you can strip it and pass to the
> > decoder. Passing the BOM to the decoder sounds wrong because you'd be
> > expecting ito choose the codec when decoding. That's what Notepad does:
> > if there's a BOM, it decodes as UTF-8; otherwise it decodes as ANSI.
> 
> Right. But the issue is that the 'easiest' way to get a file into a qstring
> so far is
> 
> QFile file;
> // ...
> QString::fromUtf8(file.readAll());
> 
> We're using that pattern btw in both Qt and Qt Creator, too. This breaks
> now in ways that can be pretty subtle (given that it only affects files
> starting with a BOM, and that the BOM isn't displayed usually).
> 
> > Having the BOM there also breaks roundtrip:
> > 	QString bom = u"\ufeff" "any string goes here";
> > 	QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);
> > 
> > QString::toUtf8 does not, cannot and will never add the BOM. It would
> > break concatenation.
> 
> So you'd have to add a BOM explicitly to the file before writing, if you
> really want it.
> 
BOM has no official meaning and function other than as a zero-width non-
breaking space in UTF-8. It was only meant as a byte-order marker in 16- and 
32-bit unicode. If you add it to unix files it breaks other magic markers at 
the beginning of the file. UTF-8 BOM is a Windows specific non-standard hack 
that is recommended against. So yes, anyone that wants it needs to add it 
themselves, as it becomes part of the text content on any other platform

`Allan