[Development] QTBUG-23489: Implement the new regular expression classes using PCRE

Sat Feb 25 23:25:16 CET 2012

2012/2/25 Thiago Macieira <thiago at kde.org>:
> On sábado, 25 de fevereiro de 2012 02.48.11, Giuseppe D'Angelo wrote:
>> 1) add the operator QRegularExpression() const to QRegExp and remove
>> the QString overloads taking a QRegExp:
>> Drawbacks:
>> * possible subtle bugs introduced by the different QRegExp and
>> QRegularExpression behaviours/quirks/syntax support/etc.
>
> This would be my preferred way, but if a slight behaviour change exists, then
> these two operations would produce different results:
>
>        int idx = string.indexOf(rx);
>        int idx = rx.indexIn(string);

I can't figure out an example with indexIn, but f.i. with replace:

  QString str("aaa");
  str.replace(QRegExp("a*(a)*"), "<\\1>")
  // str is now "<a><>", instead of the expected "<><>"

This is because of the default pattern syntax (RegExp) which exhibits
this totally odd behaviour (why did it capture only ONE a? It's even
documented that it should have captured all three...). The RegExp2 and
QRegularExpression do "the right thing". By the way, this is also why
the docs say that QRegExp will switch to RegExp2 as default syntax in
Qt 5 (something not already done, actually).

This is more or less my #1 concern of the conversion, because it's
about the engine itself and not the patterns.

>From a pure syntax point of view:

- RegExp and RegExp2: they seem to follow the general rule that an
escaped character that doesn't have any special meaning (for QRegExp)
stands for the character itself. That means that f.i. the pattern /\z/
matches (in QRegExp) a literal "z". But that, inside a
QRegularExpression, matches the very end of the string. So the pattern
needs to be analyzed and these escapes fixed.

- Wildcard and WildcardUnix: QRegExp already converts them internally
in Perl-like regular expressions (cf. qt_regexp_toCanonical and wc2rx
in qregexp.cpp), so it should just be a matter of doing the same.

- FixedString: simply needs to be escaped through QRegularExpression::escape.

- W3CXmlSchema11: I don't know yet. From a quick glance over the spec:
* new sequences like \c, \i, \p and their uppercase variants are added;
* it supports subtraction inside character classes, f.i.
[a-z-[aeiou]]. It can be nested.

- Case sensitive: set the right option

- Minimal matching: should be doable by setting the inverted greediness option

> The whole reason why we kept QRegExp in the first place was so that we wouldn't
> introduce those subtle issues (and for qmake). So, aside from qmake, if  we
> managed to convert the QRegExp expressions to PCRE, we could also get rid of
> the QRegExp engine altogether.

I'll try to write down a converter and see the results.

>
> In which case we could even keep QRegExp as a deprecated class for these
> methods:
>
>> * decide what to do with the overloads taking a non-const-ref QRegExp
>> (used to extract captures, matched length, etc.)

What about this?

>> what it *exactly* does... who's the QRegExp maintainer?)
>
> I guess that's now you.

Uh oh :-)

>> Comments?
>
> How difficult is it to convert the QRegExp expressions to PCRE?

I tried to elaborate about this before.

-- 
Giuseppe D'Angelo