[Development] [Question] Implementation of XML character validation

Kurt Pattyn pattyn.kurt at gmail.com
Sun Sep 8 13:42:24 CEST 2013


All XML validation in Qt is based on XML 1.0 (and not the newer 1.1 standard).
I found at least 3 places where validity is checked:

1. in qxmlstream.cpp:

Method resolveCharRef:

	//checks for validity
	ok &= (s == 0x9 || s == 0xa || s == 0xd || (s >= 0x20 && s <= 0xd7ff)
           	|| (s >= 0xe000 && s <= 0xfffd) || (s >= 0x10000 && s <= QChar::LastValidCodePoint));


Method scanUntil:

	//checks for invalidity
	if (c < 0x20 || (c > 0xFFFD && c < 0x10000) || c > QChar::LastValidCodePoint )


2. In qxmlutils.cpp:

bool QXmlUtils::isChar(const QChar c)
{
    return (c.unicode() >= 0x0020 && c.unicode() <= 0xD7FF)
           || c.unicode() == 0x0009
           || c.unicode() == 0x000A
           || c.unicode() == 0x000D
           || (c.unicode() >= 0xE000 && c.unicode() <= 0xFFFD);
}

It is pretty much the same as the above checks, except that it doesn't check for characters in the range 0x10000 - 0x10FFFF.
It think this is a bug, especially because the source is referring to the standard at http://www.w3.org/TR/REC-xml/#NT-Char, which says:

[2]   Char	       ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */


------------------------------------
Now, I have three questions:

1. Can someone confirm if the check in QXmlUtils is actually a bug?
2. Wouldn't it be better to move these checks to QChar, so that at least there is only one implementation?
3. Is there a reason to stick to XML1.0, or should Qt also implement the XML1.1 standard?
According to the XML 1.1 standard (http://www.w3.org/TR/xml11/#charsets), allowed characters are:

[2]   Char             ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a]  RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

So the allowed character range is a little bit extended (now includes all characters between 0x0001 and 0x0020). In addition, XML1.1 has defined some characters to be highly discouraged, but still valid.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20130908/0ffbb631/attachment.html>


More information about the Development mailing list