[Development] QRegularExpression -- first round of API review

Thu Dec 15 23:53:19 CET 2011

Hi Giuseppe,

I'll start by saying tl;dr. But I didn't stop because of your e-mail, I'm actually referring to the API.

I started looking at it and it seems too cluttered. Specially this early in the process. It's hard to review something that is trying to be everything or maybe it's just exposing too many things.

I would like to challenge you to do the opposite: give us the minimum API that can offer the most important features. To get there start from short self-contained code examples showing the features in use (I'm interested in seeing those) -- API design is about how it gets used, not so much about the number of features.

For instance, in Perl, regular expressions are essentially a string, a couple of operators and some magic variables for the captures (and lots of magic everywhere...). (Now, I'm not saying we should do regular expressions in Qt as they are done in Perl)

The API, as it stands seems too hard-wired to the implementation or feature set of the engine giving it little opportunity for evolving. Another issue is that it is hard for me to see if (or where) the API itself is imposing performance penalties on the implementation.

Finally, note that it is usually ok to add API as time goes by, but our BC promises mean we have to maintain most of what we add for a long time.

Cheers,

João

________________________________________
From: development-bounces+joao.abecasis=nokia.com at qt-project.org [development-bounces+joao.abecasis=nokia.com at qt-project.org] on behalf of ext Giuseppe D'Angelo [dangelog at gmail.com]
Sent: 15 December 2011 17:43
To: development at qt-project.org
Subject: [Development] QRegularExpression -- first round of API review

    Hi everybody. Sorry for the length of thjis message, but doing API
    reviews by mail is hard, and I needed to explain many decisions here and
    there (and, of course, the API itself). :-(

    Attached to this mail (and also here: <http://pastebin.com/KzsGFXJC> --
    if you don't want to download and open a file) you can find a stripped
    down version of the QRegularExpression API I'm working on. It's
    deliberately stripped down because at this stage I'd like to have
    feedbacks about the API and not about the code itself. We're still free
    of changing all parts of it, so, even if you dislike the tiniest bit of
    something, please go ahead and say it.

    And it's deliberaltely without comments, so the reader's exercise now is
    reading it and finding out if it's understandable at a first glance.
    Once you're done, I'll start commenting the non-obvious points.

General
    As discussed, two implicitly shared, copy on write, reentrant value
    classes: the regexp itself (pattern + options) and the result of a match
    against a subject string (fixes the T7 in
    <http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5> ). The regexp
    also stores the compiled form of itself as an hidden optimization --
    most const methods actually do modify the shared instance. The backend
    is PCRE so the feature set is quite big (fixing T1-T4). I have already
    planned to expand the enums to support even more options.

QRegularExpression
  PatternOptions
    The PatternOptions flag contains the pattern options, i.e. modify the
    characters that the pattern should match.

    *   CaseInsensitiveOption is the /i flag (case Insensitive Match). In
        QRegExp it was set by a separate method (setCaseSensitivity),
        holding a Qt::CaseSensitivity value. Here IMHO it just clutters the
        API to have different getters and setters for the options.

    *   InvertedGreedinessOption is the /U flag (not present in Perl)
        inverts the greediness of the quantifiers: they became not greedy by
        default, and greedy if followed by ?. See also the T3 in the list.

    *   DotMatchesEverythingOption is the /s flag (Single line match): the
        dot (.) metacharacter matches everything including newlines.

    *   MultilineMatchOption is the /m flag (Multi line match): the caret
        (^) and the dollar ($) metacharacters are allowed to match at (and
        just before) any newline inside the string (as well as the ordinary
        begin and end of the string)

    *   ExtendedPatternSyntaxOption is the /x flag (eXtended pattern
        syntax): whitespace data inside the regexp is ignored, and an
        unescaped # outside a character class starts a comment that goes
        till the next newline (inside the pattern!). It's probably not very
        useful in C++, where one can use the default rules for string
        literals, but it may be handy if one stores complicated regexps in
        an external file.

    *   UseUnicodePropertiesOption enables the use Unicode properties for
        character classes such as \w, \s, \d, etc. (and obviously their
        counterparts \W, \S, etc.), instead of making them just match the
        standard ASCII ranges

    *   AllowDuplicatedNamesOption allows duplicated named capturing
        patterns (that is (?<NAME> ... ) patterns)

    *   DontCaptureOption disables the use of unnamed capturing parenthesis
        -- they all behave as if they all were (?:...) parenthesis. Named
        ones continue to work.

    *   MatchFirstlineOnlyOption restricts an unanchored pattern to start to
        match only before the first newline (although the match can continue
        over it).

    *   DollarMatchesOnlyAtEndOption changes the meaning of the dollar ($)
        metacharacter to match only at the very end of the string, while
        usually it matches at the end AND at a newline right before the end

  CompileHints
    The CompileHints are hints to the regexp engine itself. Basically they
    control the optimization of the compiled regexp (the PCRE API is
    pcre_study).

    *   StudyPatternCompileHint tells PCRE engine to study the pattern,
        which basically means performing very simple optimization tasks
        (f.i., the engine records the minimum size of a string to get a
        successful match, and if the subject string is shorter the match
        aborts immediately)

    *   JitCompileHint implies StudyPatternCompileHint, and enables the JIT
        compilation of the pattern. Of course this has a cost, and there's a
        tradeoff to consider between spending time compilating the regexp
        instead of just matching against a string.

  MatchOptions
    The MatchOptions control the match itself.

    *   AnchoredMatch forces the match to start at the first matching point
        inside the subject string, even if the regexp doesn't have carets or
        other metacharacters that normally force such a match. In QRegExp
        this is what CaretAtOffset did.

    *   NonEmptyMatch tells the engine that an empty match is not considered
        a success (obviously, even if the regexp allowed one); therefore,
        the engine continue searching for a non-empty match.

    *   NonEmptyAtStartMatch tells the engine that an empty match at the
        beginning of the subject string is not acceptable as a match, but an
        empty match occuring elsewhere is accepted.

  Error reporting
   isValid
    isValid() compiles the pattern and returns true if it was successfully
    compiled, false otherwise (f.i. the pattern contains a syntax error). A
    default constructed QRegularExpression is NOT valid.

   patternErrorOffset
    patternErrorOffset() returns the offset (in UTF-16 codepoints) inside
    the pattern string where a syntax error was found, or -1 if this is not
    applicable.

   errorString
    errorString() returns a translated, textual description of the regexp
    error. What should it return in case of no error? "No error"? An empty
    string?

  Quoting
   escape
    It implements perl's quotemeta -- _all_ characters outside [a-zA-Z0-9_]
    get escaped by a backslash.

  The Matching Functions
    Here we go with the hardcore parts. :-)

    All the matching functions return a QRegularExpressionMatch that can be
    used to know if there was a full/partial match, to extract named
    subpatterns, and to repeat the match on the subject string emulating the
    /g option (see below for more info). All of them take a string to search
    (the _subject_ string), plus other options.

   match / exactMatch
    They basically do the same thing: look for a match inside the subject.
    match() succeeds if there's a substring of the subject string that
    matches against the pattern; exactMatch() requires that the whole
    subject string matches (in practice, it surrounds the pattern with \A
    and \z). A difference I'd like to introduce w.r.t. QRegExp is that a
    failed match with both of these functions does NOT return (in the
    QRegularExpressionMatch object) the offsets of the partial match, if
    any. If one is looking for partial matches then partialMatch should be
    used.

   lastMatch
    What lastMatch does is to return the last match that you would get by
    starting at the beginning of the string and matching forward (as in /g).
    See also the discussion below. This is really tricky to do correctly --
    PCRE doesn't support it directly. Also, obviously, one wants to avoid to
    do the /g loop and take the last result. Perhaps a trick could be
    rewriting the pattern as "(?s:.*?)pattern"). As of now, lastMatch
    doesn't accept any option -- QRegExp accepted an offset from the end of
    the string, I guess I could simply add it (plus a starting offset and
    the usual matching options?).

   partialMatch
    partialMatch is used whenever we're looking for a partial match inside
    the subject string. A partial match happens when, while we're
    successfully matching a pattern against a subject, the end of the
    subject string is reached. In an ordinary context usually this is a
    failed match, but there are use cases in which we'd like instead to be
    notified about the partial match found. Doing a partial match is quite
    inefficient compared to an ordinary match because most of the
    optimizations cannot be employed (f.i. the aforementioned check on the
    string length >= the minimum required by the pattern, or backward
    matching, etc.). Note that of course we can get a complete match while
    attempting a partial match.

    The use cases for this are basically:

    *   validators: if we have, say, a QRegExpValidator on a QLineEdit, then
        while the user is typing, a partial match denotes an "Intermediate"
        status of the validator -- the string may become fully Valid when
        the user finishes to enter it.

    *   incremental matching (see T6): in order to scan a big text, we would
        like to provide it in chunks to the regexp engine. The obvious
        problem is "what if the text that matches the pattern is across 2+
        chunks" -- that's where you could use partial matching. A partial
        match becomes a way of saying "feed me with more input".

    ... plus others that I'm obviously forgetting right now.

    TRIVIA: if I try to match the string "abca" against the pattern
    "(abc)+", what should be the result of partialMatch?

     1) No match
     2) Partial match found
     3) Full match found
     4) I deserve to be punched in the face because I ask tricky questions

    The answer is (tadah!) the fourth. In fact, both partial and full
    matches are good answers to that question. It all depends if we prefer
    to have a full match over a partial match (f.i.: the validator) or not
    (f.i.: incremental matching). In fact, note that if we are doing
    incremental matching an answer of "full match" would be wrong in case
    after "abca" the text continues with "bc" -- the + quantifier is greedy,
    thus it should have matched the whole "abcabc".

    This whole behaviour is controlled by the preferCompleteMatch parameter.
    Setting it to true enables the PCRE SOFT partial matching, which means
    that it tries to find a complete match, and if it fails, it reports a
    partial match (if any). Setting it to false enables the PCRE HARD
    partial matching, which means that the matches bails out as soon as we
    hit the end of the subject string. See also man 3 pcrepartial for more
    details about the two matching modes.

    (I'm not even sure it should be a boolean parameter -- perhaps I should
    change it into an enum? If so, any suggestion for the names?)

QRegularExpressionMatch
    This class contains the result of a match (plus methods to implement /g,
    see below).

  Information about the match
    hasMatch() and hasPartialMatch() carry the obvious information.

  Error reporting
    isValid() returns false in case the QRegularExpressionMatch was
    default-constructed, or there was a matching error (for instance,
    backtracking limit reached). Unfortunately, PCRE provides no builtin
    strings to translate matching errors to strings. Should we build ours
    and provide an errorString() method as well?

  Capture group information
    This also includes information about the whole match, which is
    "implicitly" inside the capture group 0.

   capturedTextsCount
    Returns the number of capturing groups that matched. But actually I need
    another name for this, because I would like to return the highest index
    of a capturing group that captured something (so that one can extract
    all the substrings). Note that not all capturing groups are required to
    match something. "(a)(b)?(c)" matches "ac" with 2 groups (#1 and #3,
    group #2 doesn't match), and the highest index is therefore 3.

   cap, capRef, capturedTexts, capturedTextsForName
    cap and capRef return the QString (resp. QStringRef) that was captured
    by the i-th group (int overload). For the QString overload, they return
    the QString/QStringRef that was captured by the leftmost named group
    with that name (in the order they appear in the pattern) that matched
    something. This is coherent with Perl's %+ hash (cf. perldoc perlre) --
    other RE engines (Python) don't support duplicated names inside the
    pattern at all. We support them, provided that the right pattern option
    is used.

    capturedTexts simply returns all the captured texts.

    capturedTextsForName returns all the captured texts by the capturing
    groups with a given name.

    Should I add QList<QStringRef> variants for these two last methods as
    well?

    Although it's discouraged to distinguish between null and empty
    QStrings, here they DO make a difference -- a null QString means that a
    certain group did not capture at all, while an empty QString that it
    captured an empty string. I think the semantic is clear and does not
    lead to ambiguities.

   pos, matchedLength, endPos
    They return the start position, length, and end position (in UTF-16
    codepoints) of each capturing group inside the subject string -- same
    semantic than cap() apply.

    I'm missing the methods for returning ALL the starting
    positions/lengths/end positions for all the groups for a given name (or
    all groups). Any suggestion for such methods?

  Iterators
    Implementing /g (Global match) correctly (aka: as Perl does) is tricky
    because of patterns that may potentially match an empty string (that is,
    match 0 characters). For instance, "a*b*|c" inside "ccaabbd" should
    return, under /g:

     1) match at position 0, length 0
     2) match at position 0, length 1 (c)
     3) match at position 1, length 0
     4) match at position 1, length 1 (c)
     5) match at position 2, length 4 (aabb)
     6) match at position 6, length 0
     7) match at position 7, length 0

    Note that the QRegExp "loop" solution in the docs fails miserably at
    this (matching "a*" inside "b" causes an endless loop due to
    matchingLength() = 0).

    The *right* algorithm is:

    *   Find the first match normally

    *   If the last match didn't match an empty string, do a normal match
        from the current offset inside the subject.

    *   If the last match matched an empty string, then attempt an ANCHORED,
        NON EMPTY AT START match at the current offset inside the subject.
        If it succeeds, then that's the next match. Otherwise, advance by 1
        character and perform another ordinary match.

    Although doable with our API, this is TOO tricky to be correctly
    performed by an user. That's why I proposed to have methods to perform
    this.

    Note: these methods iterate on the SAME subject string -- an "high
    level" incremental global matcher is not offered by this API.

   nextMatch(), operator++
    nextMatch() advances the match of "this" in the subject string following
    the algorithm above. operator++ does the same. The intended API usage is

     QRegularExpression rx(...);
     QRegularExpressionMatch m = rx.match(str);
     while (m.hasMatch()) {
        ...
        m.nextMatch();
     }

    or

     for (auto m = rx.match(str); m.hasMatch(); ++m) {
        ...
     }

    I am just a bit skeptical about operator++ because it leads to wanting
    operator-- (and previousMatch) for backwards iteration.

   previousMatch, operator--
    These do backwards iteration in the /g loop, but I'm very skeptical
    about implementing it -- it's too harsh to do it correctly (again,
    unless one does the whole /g loop forward from the start every time, or
    caches the result). There's a valid use case, though -- backward search
    in an editor, for instance. Opinions? Maybe an editor could simply save
    the offsets of a forward /g search and use those?

The End!
    Well. If you survived, thanks for reading and sharing your feedbacks :-)