[Development] [Question] Implementation of XML character validation

Konstantin Ritt ritt.ks at gmail.com
Sun Sep 8 20:00:45 CEST 2013


[2]     Char     ::=     #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
| [#x10000-#x10FFFF]    /* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
in XML 1.0 is quite the same as
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any
Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] |
[#x86-#x9F]
in XML 1.1, except of [U+007F..U+0084] and [U+0086..U+009F], which are
prohibited now.

The code looks correct for XML 1.0, however I didn't find the surrogates
validation code neither in qxml*, neither in QUtfCodec-s. I'll probably
write some additional tests once have a time for that.

You may want to raise a suggestion/feature request via
http://bugreports.qt-project.org/ about upgrading XML support in Qt up to
1.1


Regards,
Konstantin


2013/9/8 Kurt Pattyn <pattyn.kurt at gmail.com>

> All XML validation in Qt is based on XML 1.0 (and not the newer 1.1
> standard).
> I found at least 3 places where validity is checked:
>
> *1. in qxmlstream.cpp:*
>
> Method resolveCharRef:
>
>
> 	//checks for validity
>
> 	ok &= (s == 0x9 || s == 0xa || s == 0xd || (s >= 0x20 && s <= 0xd7ff)
>
>            	|| (s >= 0xe000 && s <= 0xfffd) || (s >= 0x10000 && s <= QChar::LastValidCodePoint));
>
>
>
> Method scanUntil:
>
> 	//checks for invalidity
>
> 	if (c < 0x20 || (c > 0xFFFD && c < 0x10000) || c > QChar::LastValidCodePoint )
>
>
>
> 2. *In qxmlutils.cpp*:
>
> bool QXmlUtils::isChar(const QChar c)
>
> {
>
>     return (c.unicode() >= 0x0020 && c.unicode() <= 0xD7FF)
>
>            || c.unicode() == 0x0009
>
>            || c.unicode() == 0x000A
>
>            || c.unicode() == 0x000D
>
>            || (c.unicode() >= 0xE000 && c.unicode() <= 0xFFFD);
>
> }
>
>
> It is pretty much the same as the above checks, except that it doesn't
> check for characters in the range 0x10000 - 0x10FFFF.
> It think this is a bug, especially because the source is referring to the
> standard at http://www.w3.org/TR/REC-xml/#NT-Char, which says:
>
>
> [2]   Char	       ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | *[#x10000-#x10FFFF]*	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
>
>
>
> ------------------------------------
> Now, I have three questions:
>
> 1. Can someone confirm if the check in QXmlUtils is actually a bug?
> 2. Wouldn't it be better to move these checks to QChar, so that at least
> there is only one implementation?
> 3. Is there a reason to stick to XML1.0, or should Qt also implement the
> XML1.1 standard?
> According to the XML 1.1 standard (http://www.w3.org/TR/xml11/#charsets),
> allowed characters are:
>
>
> [2]   Char             ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
>
> [2a]  RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
>
>
> So the allowed character range is a little bit extended (now includes all
> characters between 0x0001 and 0x0020). In addition, XML1.1 has defined some
> characters to be highly discouraged, but still valid.
>
>
>
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> http://lists.qt-project.org/mailman/listinfo/development
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20130908/868d8a17/attachment.html>


More information about the Development mailing list