[Development] QRegularExpression -- first round of API review

Fri Dec 16 03:28:24 CET 2011

On 15 December 2011 22:53,  <joao.abecasis at nokia.com> wrote:
> Hi Giuseppe,

Hi João,
thanks for the comments.

> I'll start by saying tl;dr. But I didn't stop because of your e-mail, I'm actually referring to the API.
>
> I started looking at it and it seems too cluttered. Specially this early in the process. It's hard to review something that is trying to be everything or maybe it's just exposing too many things.
>
> I would like to challenge you to do the opposite: give us the minimum API that can offer the most important features. To get there start from short self-contained code examples showing the features in use (I'm interested in seeing those) -- API design is about how it gets used, not so much about the number of features.

Then we should discuss what those "important features" are. In my mail
I pointed out how the API is supposed to fix the T1-T7 issues, but I
can understand not all of them have the same priority.

(Other sub-objectives are 1) keeping the QRegExp "feature set" 2) keep
existing stuff working; but I guess that if QRegExp gets moved in a
stand-alone module, we simply can make other modules in Qt depend on
that one until we have a replacement, and thus not breaking anything.)

I'd like to collect feedbacks about these priorities, especially from
the intended users of the API! If you have the chance, please tell
developers to give their opinions.

Continuing: you're totally right about the missing examples. I'll
write some down and then post them.

> For instance, in Perl, regular expressions are essentially a string, a couple of operators and some magic variables for the captures (and lots of magic everywhere...). (Now, I'm not saying we should do regular expressions in Qt as they are done in Perl)

That's basically the same here, but there are also options to control
the optimization level (you may not want to lose time optimizizing,
esp. for quick, one-shot matches; or prefer to optimize in case you
have to filter a model of 100k+ strings) and the match behaviour (some
options are actually implemented in Perl -- see what I wrote about the
/g algorithm -- but they are not avaiable to the user).

At high level:
- QRegularExpression is the pattern+options, as in Perl's "/pattern/options".
- QRegularExpressionMatch is the rough equivalent of Perl's capturing variables:
  - $& -> cap(0)
  - $1...$n -> cap(n)
  - $+[n] -> pos(n)
  - $-[n] -> endPos(n)
  - $+{str} -> capturedText(str)
  - $-{str} -> capturedTextsForName(str)
 etc.

Plus, the match object is used to implement /g, that is, to repeatedly
match on a string (as convenience API).

> The API, as it stands seems too hard-wired to the implementation or feature set of the engine giving it little opportunity for evolving.

Where do you feel I could improve this? F.i. removing some enum is
easy, but I guess the problem isn't there.

> Another issue is that it is hard for me to see if (or where) the API itself is imposing performance penalties on the implementation.

The biggest hit as of now is obviously the conversions all around the
place from/to UTF-8, which have a memory and time impact. This will
eventually be solved when PCRE ships with native UTF-16 support (see
Stephen Kelly's mail about this). Apart from that the implementation
is almost straightforward -- just call the method in PCRE that does
the work and grab the result.

The other hit I talked about is inside the implementation of iterating
backwards (and lastMatch). I cannot see any good general way to
implement it, apart from iterating forward from the beginning (and
eventually caching such results. A match result in terms of captured
strings is quite small -- we just keep the offsets inside the
subject). The fact is that matching backwards is simply something a
bit outside the regexp world.

Cheers,
--
Giuseppe D'Angelo