[Qt-interest] QRegExp for document analysing
Omar AKHAM
crtx.omar at gmail.com
Sun Dec 5 18:38:37 CET 2010
Hi,
I'm a new Python/Qt4 (PyQt4) developer, and I have some basic knowledge
about "Regex". I have to treat some text to extract a precise
information from it and I need some help :).
So, I'm experimenting on Cacm Collection. I have a file called
"cacm.all" which contains a documents (3204) description as following :
.
.
.
.I 20
.T
Accelerating Convergence of Iterative Processes
.W
A technique is discussed which, when applied
to an iterative procedure for the solution of
an equation, accelerates the rate of convergence if
the iteration converges and induces convergence if
the iteration diverges. An illustrative example is given.
.B
CACM June, 1958
.A
Wegstein, J. H.
.N
CA580602 JB March 22, 1978 9:09 PM
.X
20 5 20
20 5 20
20 5 20
.I 21
.T
Algebraic Formulation of Flow Diagrams
.B
CACM June, 1958
.A
Voorhees, E. A.
.N
CA580601 JB March 22, 1978 9:10 PM
.X
21 5 21
21 5 21
21 5 21
679 5 21
21 6 21
407 6 21
3184 6 21
.
.
.
Each document description start with ".I DOC_NUM" and contains some
descriptif sections like Title ".T", Summary ".W" and so on.
My experiment consist to extract document Number, its title and its
summary and ignore other sections into a list in order to INDEX them. I
tried with a Python "re" just like that :
cacmCollection = open('cacm.all', 'rU').read()
regex = r'(\.[I]\s+\d+)\n(?:(\.(?:T|W)))+?'
docs = re.findall(regex, self.cacmCollection)
I'm waiting for tuples as :
[('.I 20', '.T','Accelerating Convergence of Iterative
Processes','.W','A technique is discussed which, when applied\nto an
iterative procedure for the solution of\nan equation, accelerates the
rate of convergence if\nthe iteration converges and induces convergence
if\nthe iteration diverges. An illustrative example is given.')
,('.I 21', '.T','Algebraic Formulation of Flow Diagrams'),......]
And I haven't what I'm waiting for...
Can anyone correct me ?? or show me another technique to do that
(without using a loop iteration with comparisons) [with both "re" python
module or "QRegExp"]
Thanks
Omar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.qt-project.org/pipermail/qt-interest-old/attachments/20101205/323927fc/attachment.html
More information about the Qt-interest-old
mailing list