[Development] QRegularExpression -- first round of API review

Wed Jan 11 19:43:11 CET 2012

On Wednesday, 11 de January de 2012 17.53.17, Giuseppe D'Angelo wrote:
> 2012/1/11 Thiago Macieira <thiago.macieira at intel.com>:
> 
> Hi, thanks for the reply. :)
> 
> > First, you'll need to get a Nokian to import the PCRE sources. You cannot
> > submit them to Gerrit (not even to the commit you made) because that
> > violates the CLA. You're not the author. Please don't submit the PCRE
> > code again -- just pretend it's there.
> 
> No problem at all. We can wait until the 8.30 release and import that.
> As a future-proofing question: a Nokian is needed to upgrade PCRE as
> well, right?

You cannot submit any changes you have not authored / don't have the copyright 
on. The exactly language of the CLA you can find in the CLA :-)

> > The API looks now a lot more digestible. There are still a few methods
> > that I will need the documentation for, as I can't guess what they are
> > from their name ("subject" is probably an RE term that I don't know).
> 
> "Subject" simply means "the string you're matching your regexp
> against", so there's no confusion when one talks about "strings"
> (there are the pattern string and the subject string).

Ok. But then see below...

> > The API around the
> > captured texts may need a few more rounds of discussion. The name "cap"
> > appeared in Qt 3 and if we're not able to keep source compatibility with
> > Qt 4 anyway, maybe it's time to fix it too.
> 
> Any suggestion is welcome (a verbose "captured(int nth)")? I could also
> rename startPos / endPos to startPosition / endPosition.

I'd also add "captured" somewhere in the names of those too.

> > The iterating methods, which are the
> > cool thing about this API, seem to be lost. I don't see how to get the
> > contents of that match.
> 
> What do you mean?

The are at the bottom, seemingly useless. They're "lost in the sea of other 
methods". I could not find other methods that operated on the iterated 
captures. I only found the random-access captures (cap & family) and the 
global result ones.

> > Specific questions:
> >>     *   QRegularExpressionMatch::captureCount returns actually the
> >> highest
> >>         index of a capturing group that matched something. Ideas?
> >>         (lastCapturedIndex?)
> > 
> > It seems that they are the same thing. captureCount looks fine if the
> > other
> > methods also have "capture" in the name. Does this return the number of
> > named captures too? E.g. imagine I have two named captures in my RE and
> > nothing else. If they match, will that return 2?
> 
> Yes, but because you get *three* capturing groups: the implicit group
> 0 (the whole match) and the two named ones. Therefore, the highest
> index that captured is 2. If the second didn't capture, then
> captureCount() would return 1. But if the first didn't capture and the
> second did, then the returned value is still 2! The idea is that you
> can use this value know to extract all the captured groups.

Then call it lastCaptureIndex, I guess. The count would have to be 3.

> My doubt was a bit more general: a method called "captureCount"
> (although on the *match* class) could mislead people into thinking
> that that's the number of capturing groups in the regular expression
> itself, and not related about what's the index of the last capturing
> group that matched.

Keep the expression class with captureCount, as it's the number of groups in 
the expression, not the number of captures succeeded.

> In that case cap(1) will return a null (not an empty!) QString. I can
> add some convenience for that (hasCaptured()?).

Convenience would be nice. Name TBD.

> > endPos should be one after the last character matched, so that in all
> > circumstances
> >        end = start + length
> > 
> > This holds for all containers, like QString, QByteArray, QVector, etc. If
> > this is difficult to visualise in the API, remove the "end" methods and
> > keep only start and length.
> 
> Ok, then I'll do it.
> 
> My point was that end = start + length implies that the actual ending
> is one codepoint less than what the end value is.
> F.i.: "abc" =~ /abc/
>  - start = 0
>  - length = 3
>  - end = 3 (But the actual ending offset is 2. string.at(3) is even
> out of bounds)

That is correct. end is one past the last last valid. It's the first that isn't 
included.

> You're right about multiline -- without it, \A and ^ are the same
> thing. In the general case you would use \A. You still need \z and not
> $ though, because $ matches even a newline before the ending (that is,
> "a string\n" matches "^a str\\w+$").

But does it match "a string\nfoo" ?

> Right. I'll devise something then. I'm not particulary happy with
> either of them, because I can't imagine a good idea of having an
> operator== / operator!= on a match result, and "hasNext" is non
> trivial -- it actually requires doing the match. As of now this last
> one can be rewritten as
> 
>     match = re.match();
>     while (match.hasMatch())  {
>          /* ... */
>         match.advanceMatch();
>      }
> 
> which is indeed not consistent with Java iterators naming.

I agree. The result being also iteratable is weird. If it's not too complex, 
having a separate iterator class might be better.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
     Intel Sweden AB - Registration Number: 556189-6027
     Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120111/c87374aa/attachment.sig>