[Qt-interest] QRegExp for document analysing

Mon Dec 6 13:59:03 CET 2010

Okay, thank you...So how can I know when to use a regex and when to use 
a normal parser ?

Omar.

On 06/12/10 09:47, Diego Iastrubni wrote:
> Parse it by reading lines and using a "normal" parser. I don't think 
> that regex is the best tool for this job.
>
> On Sun, Dec 5, 2010 at 7:38 PM, Omar AKHAM <crtx.omar at gmail.com 
> <mailto:crtx.omar at gmail.com>> wrote:
>
>     Hi,
>
>     I'm a new Python/Qt4 (PyQt4) developer, and I have some basic
>     knowledge about "Regex". I have to treat some text to extract a
>     precise information from it and I need some help :).
>
>     So, I'm experimenting on Cacm Collection. I have a file called
>     "cacm.all" which contains a documents (3204) description as
>     following :
>
>         .
>         .
>         .
>         .I 20
>         .T
>         Accelerating Convergence of Iterative Processes
>         .W
>         A technique is discussed which, when applied
>         to an iterative procedure for the solution of
>         an equation, accelerates the rate of convergence if
>         the iteration converges and induces convergence if
>         the iteration diverges.  An illustrative example is given.
>         .B
>         CACM June, 1958
>         .A
>         Wegstein, J. H.
>         .N
>         CA580602 JB March 22, 1978  9:09 PM
>         .X
>         20    5    20
>         20    5    20
>         20    5    20
>         .I 21
>         .T
>         Algebraic Formulation of Flow Diagrams
>         .B
>         CACM June, 1958
>         .A
>         Voorhees, E. A.
>         .N
>         CA580601 JB March 22, 1978  9:10 PM
>         .X
>         21    5    21
>         21    5    21
>         21    5    21
>         679    5    21
>         21    6    21
>         407    6    21
>         3184    6    21
>         .
>         .
>         .
>
>
>     Each document description start with ".I DOC_NUM" and contains
>     some descriptif sections like Title ".T", Summary ".W" and so on.
>     My experiment consist to extract document Number, its title and
>     its summary and ignore other sections into a list in order to
>     INDEX them. I tried with a Python "re" just like that :
>
>         cacmCollection = open('cacm.all', 'rU').read()
>         regex = r'(\.[I]\s+\d+)\n(?:(\.(?:T|W)))+?'
>         docs = re.findall(regex, self.cacmCollection)
>
>     I'm waiting for tuples as :
>     [('.I 20', '.T','Accelerating Convergence of Iterative
>     Processes','.W','A technique is discussed which, when applied\nto
>     an iterative procedure for the solution of\nan equation,
>     accelerates the rate of convergence if\nthe iteration converges
>     and induces convergence if\nthe iteration diverges.  An
>     illustrative example is given.')
>     ,('.I 21', '.T','Algebraic Formulation of Flow Diagrams'),......]
>
>     And I haven't what I'm waiting for...
>     Can anyone correct me ?? or show me another technique to do that
>     (without using a loop iteration with comparisons) [with both "re"
>     python module or "QRegExp"]
>
>     Thanks
>     Omar
>
>     _______________________________________________
>     Qt-interest mailing list
>     Qt-interest at trolltech.com <mailto:Qt-interest at trolltech.com>
>     http://lists.trolltech.com/mailman/listinfo/qt-interest
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.qt-project.org/pipermail/qt-interest-old/attachments/20101206/473c4fc4/attachment.html