[Qt-interest] QRegExp for document analysing
Diego Iastrubni
diegoiast at gmail.com
Mon Dec 6 09:47:04 CET 2010
Parse it by reading lines and using a "normal" parser. I don't think that
regex is the best tool for this job.
On Sun, Dec 5, 2010 at 7:38 PM, Omar AKHAM <crtx.omar at gmail.com> wrote:
> Hi,
>
> I'm a new Python/Qt4 (PyQt4) developer, and I have some basic knowledge
> about "Regex". I have to treat some text to extract a precise information
> from it and I need some help :).
>
> So, I'm experimenting on Cacm Collection. I have a file called "cacm.all"
> which contains a documents (3204) description as following :
>
> .
> .
> .
> .I 20
> .T
> Accelerating Convergence of Iterative Processes
> .W
> A technique is discussed which, when applied
> to an iterative procedure for the solution of
> an equation, accelerates the rate of convergence if
> the iteration converges and induces convergence if
> the iteration diverges. An illustrative example is given.
> .B
> CACM June, 1958
> .A
> Wegstein, J. H.
> .N
> CA580602 JB March 22, 1978 9:09 PM
> .X
> 20 5 20
> 20 5 20
> 20 5 20
> .I 21
> .T
> Algebraic Formulation of Flow Diagrams
> .B
> CACM June, 1958
> .A
> Voorhees, E. A.
> .N
> CA580601 JB March 22, 1978 9:10 PM
> .X
> 21 5 21
> 21 5 21
> 21 5 21
> 679 5 21
> 21 6 21
> 407 6 21
> 3184 6 21
> .
> .
> .
>
>
> Each document description start with ".I DOC_NUM" and contains some
> descriptif sections like Title ".T", Summary ".W" and so on.
> My experiment consist to extract document Number, its title and its summary
> and ignore other sections into a list in order to INDEX them. I tried with a
> Python "re" just like that :
>
> cacmCollection = open('cacm.all', 'rU').read()
> regex = r'(\.[I]\s+\d+)\n(?:(\.(?:T|W)))+?'
> docs = re.findall(regex, self.cacmCollection)
>
> I'm waiting for tuples as :
> [('.I 20', '.T','Accelerating Convergence of Iterative Processes','.W','A
> technique is discussed which, when applied\nto an iterative procedure for
> the solution of\nan equation, accelerates the rate of convergence if\nthe
> iteration converges and induces convergence if\nthe iteration diverges. An
> illustrative example is given.')
> ,('.I 21', '.T','Algebraic Formulation of Flow Diagrams'),......]
>
> And I haven't what I'm waiting for...
> Can anyone correct me ?? or show me another technique to do that (without
> using a loop iteration with comparisons) [with both "re" python module or
> "QRegExp"]
>
> Thanks
> Omar
>
> _______________________________________________
> Qt-interest mailing list
> Qt-interest at trolltech.com
> http://lists.trolltech.com/mailman/listinfo/qt-interest
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.qt-project.org/pipermail/qt-interest-old/attachments/20101206/dd846728/attachment.html
More information about the Qt-interest-old
mailing list