[Development] utf-8 BOM and parsers
Simon Hausmann
simon.hausmann at digia.com
Mon Apr 14 14:26:19 CEST 2014
Hi,
We have various parsers in Qt that parse "source code" and do things with it,
such as the QML parser, the CSS parser and others. We do make the assumption
that their input is UTF-8 encoded and therefore have simply used
QString code = QString::fromUtf8(byteArray);
in some form or other, and then passed the "code" variable to the lexer. The
lexers often check for "white space" using QChar::isSpace() and act
accordingly.
When the input file started with a byte order mark, previous versions of
QString::fromUtf8 used to remove that mark and nothing happened.
In Qt 5.3 the behavior was changed and the byte order mark is present in the
resulting QString, which causes issues in parsers that do not expect that mark
to appear. (This has been reported by early testers of Qt 5.3 in various
places in Jira)
Since this affects not just one place but many (and for example we have many
copies of the QML lexer around), I'd like to determine what the _correct_ fix
for this issue is, because frankly speaking I don't know :). However I have an
interest in the same fix being applied to qtbase, qtdeclarative, qtscript,
qtcreator and other affected modules.
So I have some questions:
1) Should the character be treated as a white-space character? (one that
doesn't consume any column in the line/column reporting later) If yes, what is
the right way to fix the parsers?
1.1) Should any char.isSpace() condition be extended to check for such
markers?
1.2) Or should isSpace() be changed?
2) Alternatively, do we need a function somewhere else in Qt that removes a
trailing byte order mark from the QString and we change all parsers in Qt to
use that function?
3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect.
Is that intentional?
Simon
More information about the Development
mailing list