[Development] QRegularExpression -- first round of API review

Wed Jan 11 18:53:17 CET 2012

2012/1/11 Thiago Macieira <thiago.macieira at intel.com>:

Hi, thanks for the reply. :)

> First, you'll need to get a Nokian to import the PCRE sources. You cannot
> submit them to Gerrit (not even to the commit you made) because that violates
> the CLA. You're not the author. Please don't submit the PCRE code again --
> just pretend it's there.

No problem at all. We can wait until the 8.30 release and import that.
As a future-proofing question: a Nokian is needed to upgrade PCRE as
well, right?

Btw, I'm pushing more updated stuff on gitorius, including some method
documentation (I'm happier to push there since I can push even small
bugfixes without causing emails to be sent, sanity checks, etc)
https://qt.gitorious.org/~peppe/qt/peppes-qtbase/blobs/pcreregexp/src/corelib/tools/qregularexpression.h
https://qt.gitorious.org/~peppe/qt/peppes-qtbase/blobs/pcreregexp/src/corelib/tools/qregularexpression.cpp

> The API looks now a lot more digestible. There are still a few methods that I
> will need the documentation for, as I can't guess what they are from their
> name ("subject" is probably an RE term that I don't know).

"Subject" simply means "the string you're matching your regexp
against", so there's no confusion when one talks about "strings"
(there are the pattern string and the subject string).

> The API around the
> captured texts may need a few more rounds of discussion. The name "cap"
> appeared in Qt 3 and if we're not able to keep source compatibility with Qt 4
> anyway, maybe it's time to fix it too.

Any suggestion is welcome (a verbose "captured(int nth)")? I could also rename
startPos / endPos to startPosition / endPosition.

> The iterating methods, which are the
> cool thing about this API, seem to be lost. I don't see how to get the
> contents of that match.

What do you mean?

> Specific questions:
>
>>     *   QRegularExpressionMatch::captureCount returns actually the highest
>>         index of a capturing group that matched something. Ideas?
>>         (lastCapturedIndex?)
>
> It seems that they are the same thing. captureCount looks fine if the other
> methods also have "capture" in the name. Does this return the number of named
> captures too? E.g. imagine I have two named captures in my RE and nothing
> else. If they match, will that return 2?

Yes, but because you get *three* capturing groups: the implicit group
0 (the whole match) and the two named ones. Therefore, the highest
index that captured is 2. If the second didn't capture, then
captureCount() would return 1. But if the first didn't capture and the
second did, then the returned value is still 2! The idea is that you
can use this value know to extract all the captured groups.

The long explaination is that, in Perl regexps, a named capturing
group has also the ordinary integer number, so you can get the two
captures with cap(1) and cap(2) as well as cap("first") and
cap("second").  This has "interesting" consequences with the branch
operator (|...) and/or with duplicated names.

My doubt was a bit more general: a method called "captureCount"
(although on the *match* class) could mislead people into thinking
that that's the number of capturing groups in the regular expression
itself, and not related about what's the index of the last capturing
group that matched.

> If my RE has a capture that is optional and fails to match, how do I find out?
> Imagine:
>
>        rx = /(foo)?(bar)/
>        rx =~ "bar"
>
> In this case, the first capture failed to match anything. How do I know that in
> the API?

In that case cap(1) will return a null (not an empty!) QString. I can
add some convenience for that (hasCaptured()?).

> How is this even a problem? Under which circumstances is the triad start,
> length, end not holding?

The question was "which triad should be holding"?

> endPos should be one after the last character matched, so that in all
> circumstances
>        end = start + length
>
> This holds for all containers, like QString, QByteArray, QVector, etc. If this
> is difficult to visualise in the API, remove the "end" methods and keep only
> start and length.

Ok, then I'll do it.

My point was that end = start + length implies that the actual ending
is one codepoint less than what the end value is.
F.i.: "abc" =~ /abc/
 - start = 0
 - length = 3
 - end = 3 (But the actual ending offset is 2. string.at(3) is even
out of bounds)

>> Does a string exactly match a pattern?
>>
>>   Version 1
>>      QString str("a string");
>>      bool matches = str.contains(QRegularExpression("\\Aa str\\w+\\z"));
>>      // matches == true
>
> A non-initiated like me might write "^a str\\w+$". I'd expect that to work
> and, by default, ^ is the beginning of the string and $ the end. Note I did
> not set MultilineOption.

You're right about multiline -- without it, \A and ^ are the same
thing. In the general case you would use \A. You still need \z and not
$ though, because $ matches even a newline before the ending (that is,
"a string\n" matches "^a str\\w+$").

> This one mixes STL-style methods (operator++) with Java-style ones. Either we
> do:
>
>        for (match = re.match(); match != re.end(); ++match)
>
> or we do:
>
>        match = re.match();
>        while (match.hasNext()) {
>                /* whatever */
>                match.next();
>        }

Right. I'll devise something then. I'm not particulary happy with
either of them, because I can't imagine a good idea of having an
operator== / operator!= on a match result, and "hasNext" is non
trivial -- it actually requires doing the match. As of now this last
one can be rewritten as

    match = re.match();
    while (match.hasMatch())  {
         /* ... */
        match.advanceMatch();
     }

which is indeed not consistent with Java iterators naming.

Thanks for the feedback,
-- 
Giuseppe D'Angelo