[Development] utf-8 BOM and parsers

Mon Apr 14 14:26:19 CEST 2014

Hi,

We have various parsers in Qt that parse "source code" and do things with it, 
such as the QML parser, the CSS parser and others. We do make the assumption 
that their input is UTF-8 encoded and therefore have simply used

    QString code = QString::fromUtf8(byteArray);

in some form or other, and then passed the "code" variable to the lexer. The 
lexers often check for "white space" using QChar::isSpace() and act 
accordingly.

When the input file started with a byte order mark, previous versions of 
QString::fromUtf8 used to remove that mark and nothing happened.

In Qt 5.3 the behavior was changed and the byte order mark is present in the 
resulting QString, which causes issues in parsers that do not expect that mark 
to appear. (This has been reported by early testers of Qt 5.3 in various 
places in Jira)

Since this affects not just one place but many (and for example we have many 
copies of the QML lexer around), I'd like to determine what the _correct_ fix 
for this issue is, because frankly speaking I don't know :). However I have an 
interest in the same fix being applied to qtbase, qtdeclarative, qtscript, 
qtcreator and other affected modules.

So I have some questions:

1) Should the character be treated as a white-space character? (one that 
doesn't consume any column in the line/column reporting later) If yes, what is 
the right way to fix the parsers? 
  1.1) Should any char.isSpace() condition be extended to check for such 
markers?
  1.2) Or should isSpace() be changed?

2) Alternatively, do we need a function somewhere else in Qt that removes a 
trailing byte order mark from the QString and we change all parsers in Qt to 
use that function?

3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect. 
Is that intentional?

Simon