[Development] QRegularExpression -- first round of API review
Thiago Macieira
thiago.macieira at intel.com
Wed Jan 11 19:43:11 CET 2012
On Wednesday, 11 de January de 2012 17.53.17, Giuseppe D'Angelo wrote:
> 2012/1/11 Thiago Macieira <thiago.macieira at intel.com>:
>
> Hi, thanks for the reply. :)
>
> > First, you'll need to get a Nokian to import the PCRE sources. You cannot
> > submit them to Gerrit (not even to the commit you made) because that
> > violates the CLA. You're not the author. Please don't submit the PCRE
> > code again -- just pretend it's there.
>
> No problem at all. We can wait until the 8.30 release and import that.
> As a future-proofing question: a Nokian is needed to upgrade PCRE as
> well, right?
You cannot submit any changes you have not authored / don't have the copyright
on. The exactly language of the CLA you can find in the CLA :-)
> > The API looks now a lot more digestible. There are still a few methods
> > that I will need the documentation for, as I can't guess what they are
> > from their name ("subject" is probably an RE term that I don't know).
>
> "Subject" simply means "the string you're matching your regexp
> against", so there's no confusion when one talks about "strings"
> (there are the pattern string and the subject string).
Ok. But then see below...
> > The API around the
> > captured texts may need a few more rounds of discussion. The name "cap"
> > appeared in Qt 3 and if we're not able to keep source compatibility with
> > Qt 4 anyway, maybe it's time to fix it too.
>
> Any suggestion is welcome (a verbose "captured(int nth)")? I could also
> rename startPos / endPos to startPosition / endPosition.
I'd also add "captured" somewhere in the names of those too.
> > The iterating methods, which are the
> > cool thing about this API, seem to be lost. I don't see how to get the
> > contents of that match.
>
> What do you mean?
The are at the bottom, seemingly useless. They're "lost in the sea of other
methods". I could not find other methods that operated on the iterated
captures. I only found the random-access captures (cap & family) and the
global result ones.
> > Specific questions:
> >> * QRegularExpressionMatch::captureCount returns actually the
> >> highest
> >> index of a capturing group that matched something. Ideas?
> >> (lastCapturedIndex?)
> >
> > It seems that they are the same thing. captureCount looks fine if the
> > other
> > methods also have "capture" in the name. Does this return the number of
> > named captures too? E.g. imagine I have two named captures in my RE and
> > nothing else. If they match, will that return 2?
>
> Yes, but because you get *three* capturing groups: the implicit group
> 0 (the whole match) and the two named ones. Therefore, the highest
> index that captured is 2. If the second didn't capture, then
> captureCount() would return 1. But if the first didn't capture and the
> second did, then the returned value is still 2! The idea is that you
> can use this value know to extract all the captured groups.
Then call it lastCaptureIndex, I guess. The count would have to be 3.
> My doubt was a bit more general: a method called "captureCount"
> (although on the *match* class) could mislead people into thinking
> that that's the number of capturing groups in the regular expression
> itself, and not related about what's the index of the last capturing
> group that matched.
Keep the expression class with captureCount, as it's the number of groups in
the expression, not the number of captures succeeded.
> In that case cap(1) will return a null (not an empty!) QString. I can
> add some convenience for that (hasCaptured()?).
Convenience would be nice. Name TBD.
> > endPos should be one after the last character matched, so that in all
> > circumstances
> > end = start + length
> >
> > This holds for all containers, like QString, QByteArray, QVector, etc. If
> > this is difficult to visualise in the API, remove the "end" methods and
> > keep only start and length.
>
> Ok, then I'll do it.
>
> My point was that end = start + length implies that the actual ending
> is one codepoint less than what the end value is.
> F.i.: "abc" =~ /abc/
> - start = 0
> - length = 3
> - end = 3 (But the actual ending offset is 2. string.at(3) is even
> out of bounds)
That is correct. end is one past the last last valid. It's the first that isn't
included.
> You're right about multiline -- without it, \A and ^ are the same
> thing. In the general case you would use \A. You still need \z and not
> $ though, because $ matches even a newline before the ending (that is,
> "a string\n" matches "^a str\\w+$").
But does it match "a string\nfoo" ?
> Right. I'll devise something then. I'm not particulary happy with
> either of them, because I can't imagine a good idea of having an
> operator== / operator!= on a match result, and "hasNext" is non
> trivial -- it actually requires doing the match. As of now this last
> one can be rewritten as
>
> match = re.match();
> while (match.hasMatch()) {
> /* ... */
> match.advanceMatch();
> }
>
> which is indeed not consistent with Java iterators naming.
I agree. The result being also iteratable is weird. If it's not too complex,
having a separate iterator class might be better.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
Intel Sweden AB - Registration Number: 556189-6027
Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120111/c87374aa/attachment.sig>
More information about the Development
mailing list