[Qt-interest] Parsing input and performance of QRegExp

Tue Jul 26 16:32:01 CEST 2011

Hi! Thanks for your answer!

Helpful! But why does it slow down so fast and much?

I thought about it and found out that a parser-framework would be a bit
easier to maintain. The files can contain eng-language and german...and
build a regex for different languages is a bit ugly i think.

Do you have some experience with parser-frameworks?

Thank you!

2011/7/26 Atlant Schmidt <aschmidt at dekaresearch.com>

>  Jens:****
>
> ** **
>
>   This may be my Luddite bias showing, but your problem seems simple
>   enough that a simple state machine and a very few string-compares will
>   probably solve it better than a bunch of regexps.
>
> ****
>
>   For example:****
>
> ** **
>
>   Start in state 0.****
>
>
>   Is the game number always in the first line and always in the same
> position
>   (either starting at character [17] or at least after the “#” character,
> and always
>   terminated by the “:” character)? If so, then just hand-craft a parser
> that, in
>   state 0, only extracts the game number.
>
>   Move on to state 1 where you’ll be looking for the player names.
>
>   Are players always a “no-whitespace” string following a line that begins
>   with “Seat n:”? Hand craft a parser that grabs those. When a line doesn’t
>   start with “Seat...”, move to state 2.
>
>   In state 2, collect the costs. (I’m not sure I understand what you mean
>   there; do you mean “all the bets”?)****
>
> ** **
>
>   Working across an array of strings already in memory, this all ought to
>   run in about a millisecond or so.****
>
> ** **
>
>   By the way, if you insist on using regexps, use “anchors” (e.g., “^”).
>   And if you’re sure about the white-spacing, don’t use variable
> quantifiers.****
>
>   So, for example, search for “^Seat \d*: (\w*)” will be a lot faster
>   than “Seat\s+\d*:\s*(\w*)” because 1) it allows the regexp engine to
>   discard any line that doesn’t **BEGIN** with “Seat” and 2) it is a
> little
>   easier to match for one constant space than for an infinitely variable
>   number of spaces.
>
>   There’s lots of literature about optimizing regexps.
>
>   But this problem seems simple enough that I wouldn’t go there at all.
>
>   Atlant****
>
> ** **
>  ------------------------------
>
> *From:* qt-interest-bounces+aschmidt=dekaresearch.com at qt.nokia.com[mailto:
> qt-interest-bounces+aschmidt=dekaresearch.com at qt.nokia.com] *On Behalf Of
> *Jens Saathoff
> *Sent:* Saturday, July 23, 2011 09:13
> *To:* qt-interest at trolltech.com
> *Subject:* [Qt-interest] Parsing input and performance of QRegExp****
>
> ** **
>
> Hi!****
>
> ** **
>
> I need to parse some input. The input is a poker handhistory-file.****
>
> You can view an example here: "http://pastebin.com/rzhccyfK":
> http://pastebin.com/rzhccyfK****
>
> ** **
>
> I need the following informations:****
>
> - All Playernames****
>
> - Gamenumber****
>
> - All costs (and of each player)****
>
> ** **
>
> Let's say i  have to get all information to put in a database and to
> examine the data.****
>
> ** **
>
> My first try was "use regular expressions" and use boost::spirit.****
>
> ** **
>
> I find out that regular expressions are much slower than parsing with
> boost::spirit. The problem with spirit is that it takes a long time to
> compile. Very long, but it's fast as hell!****
>
> ** **
>
> What would you suggest? Use another parser? QLALR?****
>
> ** **
>
> Can i parse the following with QLALR?****
>
> Input1: Player raises $1 to $2 (Name: Player, amount $2)****
>
> Input2: Player raises raises $1 to $2 (Name: Player raises, amount $2)****
>
> ** **
>
> And...another thing? Is it fast?****
>
> ** **
>
> What's with other parsers? Any experience with Ragel, Bison or something
> else?****
>
> ** **
>
> ** **
>
> Nex thing: Performance of QRegExp!****
>
> ** **
>
> I did a test with the following code:****
>
> ** **
>
> void MainWindow::on_btnTest_clicked()
> {
>     int i = TestRegex();
>     qDebug() << "Dauerte: " << i << "\n";
> }
> int MainWindow::TestRegex()
> {
>     list.clear();
>     qDebug() << "Start\n";
>     tgone.start();
>     regex.setCaseSensitivity(Qt::CaseInsensitive);
>     regex.setPatternSyntax(QRegExp::RegExp2);
>     regex.setPattern("^(\\d{2,2})\\.(\\d{2,2})\\.(\\d{4,4})$");
>
>     if(regex.isValid())
>     {
>         int i=0;
>         i++;
>         for(i=0; i<5000; i++)
>         {
>             if (regex.indexIn("12.03.2011") != -1)
>             {
>                  list.append(regex.cap(0));
>                  list.append(regex.cap(1));
>                  list.append(regex.cap(2));
>             }
>         }
>     }
>     qDebug() << "Elemente: " << list.count() << "\n";
>     return tgone.elapsed();
> }****
>
>  ** **
>
> If i run the code for the first time the used time is 71 ms, on second run
> 400, on third 501. It grows!! But why?****
>
> ** **
>
> I know that other frameworks need to compile a regexe, why is there no need
> for QRegExp?****
>
> ** **
>
> Thank you very much! Im really interested in your answers!!!****
>
> ** **
>
> Click here<https://www.mailcontrol.com/sr/RsR3KVHfff7TndxI!oX7Uu4ItyQZZf3fKR7HU7IF0!Qnfpav4kKQ!bCV!KpIx8oXea8QmtQva5Ucloz8rNwSTQ==>to report this email as spam.
> ****
>
> ------------------------------
> This e-mail and the information, including any attachments, it contains are
> intended to be a confidential communication only to the person or entity to
> whom it is addressed and may contain information that is privileged. If the
> reader of this message is not the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this
> communication in error, please immediately notify the sender and destroy the
> original message.
>
> Thank you.
>
> Please consider the environment before printing this email.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.qt-project.org/pipermail/qt-interest-old/attachments/20110726/18fae805/attachment.html