[Qt-interest] QRegExp for document analysing

Omar AKHAM crtx.omar at gmail.com
Sun Dec 5 18:38:37 CET 2010


Hi,

I'm a new Python/Qt4 (PyQt4) developer, and I have some basic knowledge 
about "Regex". I have to treat some text to extract a precise 
information from it and I need some help :).

So, I'm experimenting on Cacm Collection. I have a file called 
"cacm.all" which contains a documents (3204) description as following :

    .
    .
    .
    .I 20
    .T
    Accelerating Convergence of Iterative Processes
    .W
    A technique is discussed which, when applied
    to an iterative procedure for the solution of
    an equation, accelerates the rate of convergence if
    the iteration converges and induces convergence if
    the iteration diverges.  An illustrative example is given.
    .B
    CACM June, 1958
    .A
    Wegstein, J. H.
    .N
    CA580602 JB March 22, 1978  9:09 PM
    .X
    20    5    20
    20    5    20
    20    5    20
    .I 21
    .T
    Algebraic Formulation of Flow Diagrams
    .B
    CACM June, 1958
    .A
    Voorhees, E. A.
    .N
    CA580601 JB March 22, 1978  9:10 PM
    .X
    21    5    21
    21    5    21
    21    5    21
    679    5    21
    21    6    21
    407    6    21
    3184    6    21
    .
    .
    .


Each document description start with ".I DOC_NUM" and contains some 
descriptif sections like Title ".T", Summary ".W" and so on.
My experiment consist to extract document Number, its title and its 
summary and ignore other sections into a list in order to INDEX them. I 
tried with a Python "re" just like that :

    cacmCollection = open('cacm.all', 'rU').read()
    regex = r'(\.[I]\s+\d+)\n(?:(\.(?:T|W)))+?'
    docs = re.findall(regex, self.cacmCollection)

I'm waiting for tuples as :
[('.I 20', '.T','Accelerating Convergence of Iterative 
Processes','.W','A technique is discussed which, when applied\nto an 
iterative procedure for the solution of\nan equation, accelerates the 
rate of convergence if\nthe iteration converges and induces convergence 
if\nthe iteration diverges.  An illustrative example is given.')
,('.I 21', '.T','Algebraic Formulation of Flow Diagrams'),......]

And I haven't what I'm waiting for...
Can anyone correct me ?? or show me another technique to do that 
(without using a loop iteration with comparisons) [with both "re" python 
module or "QRegExp"]

Thanks
Omar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.qt-project.org/pipermail/qt-interest-old/attachments/20101205/323927fc/attachment.html 


More information about the Qt-interest-old mailing list