[Development] QRegularExpression -- first round of API review
Giuseppe D'Angelo
dangelog at gmail.com
Thu Dec 15 17:43:49 CET 2011
Hi everybody. Sorry for the length of thjis message, but doing API
reviews by mail is hard, and I needed to explain many decisions here and
there (and, of course, the API itself). :-(
Attached to this mail (and also here: <http://pastebin.com/KzsGFXJC> --
if you don't want to download and open a file) you can find a stripped
down version of the QRegularExpression API I'm working on. It's
deliberately stripped down because at this stage I'd like to have
feedbacks about the API and not about the code itself. We're still free
of changing all parts of it, so, even if you dislike the tiniest bit of
something, please go ahead and say it.
And it's deliberaltely without comments, so the reader's exercise now is
reading it and finding out if it's understandable at a first glance.
Once you're done, I'll start commenting the non-obvious points.
General
As discussed, two implicitly shared, copy on write, reentrant value
classes: the regexp itself (pattern + options) and the result of a match
against a subject string (fixes the T7 in
<http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5> ). The regexp
also stores the compiled form of itself as an hidden optimization --
most const methods actually do modify the shared instance. The backend
is PCRE so the feature set is quite big (fixing T1-T4). I have already
planned to expand the enums to support even more options.
QRegularExpression
PatternOptions
The PatternOptions flag contains the pattern options, i.e. modify the
characters that the pattern should match.
* CaseInsensitiveOption is the /i flag (case Insensitive Match). In
QRegExp it was set by a separate method (setCaseSensitivity),
holding a Qt::CaseSensitivity value. Here IMHO it just clutters the
API to have different getters and setters for the options.
* InvertedGreedinessOption is the /U flag (not present in Perl)
inverts the greediness of the quantifiers: they became not greedy by
default, and greedy if followed by ?. See also the T3 in the list.
* DotMatchesEverythingOption is the /s flag (Single line match): the
dot (.) metacharacter matches everything including newlines.
* MultilineMatchOption is the /m flag (Multi line match): the caret
(^) and the dollar ($) metacharacters are allowed to match at (and
just before) any newline inside the string (as well as the ordinary
begin and end of the string)
* ExtendedPatternSyntaxOption is the /x flag (eXtended pattern
syntax): whitespace data inside the regexp is ignored, and an
unescaped # outside a character class starts a comment that goes
till the next newline (inside the pattern!). It's probably not very
useful in C++, where one can use the default rules for string
literals, but it may be handy if one stores complicated regexps in
an external file.
* UseUnicodePropertiesOption enables the use Unicode properties for
character classes such as \w, \s, \d, etc. (and obviously their
counterparts \W, \S, etc.), instead of making them just match the
standard ASCII ranges
* AllowDuplicatedNamesOption allows duplicated named capturing
patterns (that is (?<NAME> ... ) patterns)
* DontCaptureOption disables the use of unnamed capturing parenthesis
-- they all behave as if they all were (?:...) parenthesis. Named
ones continue to work.
* MatchFirstlineOnlyOption restricts an unanchored pattern to start to
match only before the first newline (although the match can continue
over it).
* DollarMatchesOnlyAtEndOption changes the meaning of the dollar ($)
metacharacter to match only at the very end of the string, while
usually it matches at the end AND at a newline right before the end
CompileHints
The CompileHints are hints to the regexp engine itself. Basically they
control the optimization of the compiled regexp (the PCRE API is
pcre_study).
* StudyPatternCompileHint tells PCRE engine to study the pattern,
which basically means performing very simple optimization tasks
(f.i., the engine records the minimum size of a string to get a
successful match, and if the subject string is shorter the match
aborts immediately)
* JitCompileHint implies StudyPatternCompileHint, and enables the JIT
compilation of the pattern. Of course this has a cost, and there's a
tradeoff to consider between spending time compilating the regexp
instead of just matching against a string.
MatchOptions
The MatchOptions control the match itself.
* AnchoredMatch forces the match to start at the first matching point
inside the subject string, even if the regexp doesn't have carets or
other metacharacters that normally force such a match. In QRegExp
this is what CaretAtOffset did.
* NonEmptyMatch tells the engine that an empty match is not considered
a success (obviously, even if the regexp allowed one); therefore,
the engine continue searching for a non-empty match.
* NonEmptyAtStartMatch tells the engine that an empty match at the
beginning of the subject string is not acceptable as a match, but an
empty match occuring elsewhere is accepted.
Error reporting
isValid
isValid() compiles the pattern and returns true if it was successfully
compiled, false otherwise (f.i. the pattern contains a syntax error). A
default constructed QRegularExpression is NOT valid.
patternErrorOffset
patternErrorOffset() returns the offset (in UTF-16 codepoints) inside
the pattern string where a syntax error was found, or -1 if this is not
applicable.
errorString
errorString() returns a translated, textual description of the regexp
error. What should it return in case of no error? "No error"? An empty
string?
Quoting
escape
It implements perl's quotemeta -- _all_ characters outside [a-zA-Z0-9_]
get escaped by a backslash.
The Matching Functions
Here we go with the hardcore parts. :-)
All the matching functions return a QRegularExpressionMatch that can be
used to know if there was a full/partial match, to extract named
subpatterns, and to repeat the match on the subject string emulating the
/g option (see below for more info). All of them take a string to search
(the _subject_ string), plus other options.
match / exactMatch
They basically do the same thing: look for a match inside the subject.
match() succeeds if there's a substring of the subject string that
matches against the pattern; exactMatch() requires that the whole
subject string matches (in practice, it surrounds the pattern with \A
and \z). A difference I'd like to introduce w.r.t. QRegExp is that a
failed match with both of these functions does NOT return (in the
QRegularExpressionMatch object) the offsets of the partial match, if
any. If one is looking for partial matches then partialMatch should be
used.
lastMatch
What lastMatch does is to return the last match that you would get by
starting at the beginning of the string and matching forward (as in /g).
See also the discussion below. This is really tricky to do correctly --
PCRE doesn't support it directly. Also, obviously, one wants to avoid to
do the /g loop and take the last result. Perhaps a trick could be
rewriting the pattern as "(?s:.*?)pattern"). As of now, lastMatch
doesn't accept any option -- QRegExp accepted an offset from the end of
the string, I guess I could simply add it (plus a starting offset and
the usual matching options?).
partialMatch
partialMatch is used whenever we're looking for a partial match inside
the subject string. A partial match happens when, while we're
successfully matching a pattern against a subject, the end of the
subject string is reached. In an ordinary context usually this is a
failed match, but there are use cases in which we'd like instead to be
notified about the partial match found. Doing a partial match is quite
inefficient compared to an ordinary match because most of the
optimizations cannot be employed (f.i. the aforementioned check on the
string length >= the minimum required by the pattern, or backward
matching, etc.). Note that of course we can get a complete match while
attempting a partial match.
The use cases for this are basically:
* validators: if we have, say, a QRegExpValidator on a QLineEdit, then
while the user is typing, a partial match denotes an "Intermediate"
status of the validator -- the string may become fully Valid when
the user finishes to enter it.
* incremental matching (see T6): in order to scan a big text, we would
like to provide it in chunks to the regexp engine. The obvious
problem is "what if the text that matches the pattern is across 2+
chunks" -- that's where you could use partial matching. A partial
match becomes a way of saying "feed me with more input".
... plus others that I'm obviously forgetting right now.
TRIVIA: if I try to match the string "abca" against the pattern
"(abc)+", what should be the result of partialMatch?
1) No match
2) Partial match found
3) Full match found
4) I deserve to be punched in the face because I ask tricky questions
The answer is (tadah!) the fourth. In fact, both partial and full
matches are good answers to that question. It all depends if we prefer
to have a full match over a partial match (f.i.: the validator) or not
(f.i.: incremental matching). In fact, note that if we are doing
incremental matching an answer of "full match" would be wrong in case
after "abca" the text continues with "bc" -- the + quantifier is greedy,
thus it should have matched the whole "abcabc".
This whole behaviour is controlled by the preferCompleteMatch parameter.
Setting it to true enables the PCRE SOFT partial matching, which means
that it tries to find a complete match, and if it fails, it reports a
partial match (if any). Setting it to false enables the PCRE HARD
partial matching, which means that the matches bails out as soon as we
hit the end of the subject string. See also man 3 pcrepartial for more
details about the two matching modes.
(I'm not even sure it should be a boolean parameter -- perhaps I should
change it into an enum? If so, any suggestion for the names?)
QRegularExpressionMatch
This class contains the result of a match (plus methods to implement /g,
see below).
Information about the match
hasMatch() and hasPartialMatch() carry the obvious information.
Error reporting
isValid() returns false in case the QRegularExpressionMatch was
default-constructed, or there was a matching error (for instance,
backtracking limit reached). Unfortunately, PCRE provides no builtin
strings to translate matching errors to strings. Should we build ours
and provide an errorString() method as well?
Capture group information
This also includes information about the whole match, which is
"implicitly" inside the capture group 0.
capturedTextsCount
Returns the number of capturing groups that matched. But actually I need
another name for this, because I would like to return the highest index
of a capturing group that captured something (so that one can extract
all the substrings). Note that not all capturing groups are required to
match something. "(a)(b)?(c)" matches "ac" with 2 groups (#1 and #3,
group #2 doesn't match), and the highest index is therefore 3.
cap, capRef, capturedTexts, capturedTextsForName
cap and capRef return the QString (resp. QStringRef) that was captured
by the i-th group (int overload). For the QString overload, they return
the QString/QStringRef that was captured by the leftmost named group
with that name (in the order they appear in the pattern) that matched
something. This is coherent with Perl's %+ hash (cf. perldoc perlre) --
other RE engines (Python) don't support duplicated names inside the
pattern at all. We support them, provided that the right pattern option
is used.
capturedTexts simply returns all the captured texts.
capturedTextsForName returns all the captured texts by the capturing
groups with a given name.
Should I add QList<QStringRef> variants for these two last methods as
well?
Although it's discouraged to distinguish between null and empty
QStrings, here they DO make a difference -- a null QString means that a
certain group did not capture at all, while an empty QString that it
captured an empty string. I think the semantic is clear and does not
lead to ambiguities.
pos, matchedLength, endPos
They return the start position, length, and end position (in UTF-16
codepoints) of each capturing group inside the subject string -- same
semantic than cap() apply.
I'm missing the methods for returning ALL the starting
positions/lengths/end positions for all the groups for a given name (or
all groups). Any suggestion for such methods?
Iterators
Implementing /g (Global match) correctly (aka: as Perl does) is tricky
because of patterns that may potentially match an empty string (that is,
match 0 characters). For instance, "a*b*|c" inside "ccaabbd" should
return, under /g:
1) match at position 0, length 0
2) match at position 0, length 1 (c)
3) match at position 1, length 0
4) match at position 1, length 1 (c)
5) match at position 2, length 4 (aabb)
6) match at position 6, length 0
7) match at position 7, length 0
Note that the QRegExp "loop" solution in the docs fails miserably at
this (matching "a*" inside "b" causes an endless loop due to
matchingLength() = 0).
The *right* algorithm is:
* Find the first match normally
* If the last match didn't match an empty string, do a normal match
from the current offset inside the subject.
* If the last match matched an empty string, then attempt an ANCHORED,
NON EMPTY AT START match at the current offset inside the subject.
If it succeeds, then that's the next match. Otherwise, advance by 1
character and perform another ordinary match.
Although doable with our API, this is TOO tricky to be correctly
performed by an user. That's why I proposed to have methods to perform
this.
Note: these methods iterate on the SAME subject string -- an "high
level" incremental global matcher is not offered by this API.
nextMatch(), operator++
nextMatch() advances the match of "this" in the subject string following
the algorithm above. operator++ does the same. The intended API usage is
QRegularExpression rx(...);
QRegularExpressionMatch m = rx.match(str);
while (m.hasMatch()) {
...
m.nextMatch();
}
or
for (auto m = rx.match(str); m.hasMatch(); ++m) {
...
}
I am just a bit skeptical about operator++ because it leads to wanting
operator-- (and previousMatch) for backwards iteration.
previousMatch, operator--
These do backwards iteration in the /g loop, but I'm very skeptical
about implementing it -- it's too harsh to do it correctly (again,
unless one does the whole /g loop forward from the start every time, or
caches the result). There's a valid use case, though -- backward search
in an editor, for instance. Opinions? Maybe an editor could simply save
the offsets of a forward /g search and use those?
The End!
Well. If you survived, thanks for reading and sharing your feedbacks :-)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qregularexpression.h
Type: text/x-chdr
Size: 4170 bytes
Desc: not available
URL: <http://lists.qt-project.org/pipermail/development/attachments/20111215/f4e5d072/attachment.h>
More information about the Development
mailing list