[Qt-interest] FW: Html parsing
Oliver.Knoll at comit.ch
Oliver.Knoll at comit.ch
Mon Dec 1 16:04:59 CET 2008
Frédéric LECONTE wrote on Monday, December 01, 2008 3:32 PM:
>> ..
>> One method is to run the html document through something that can
>> convert it to a valid XML document such as htmltidy.
> I don't want to valid my web page, I don't need to re-use it
That's not the point: as you mentioned yourself the page in question (..) is not valid XHTML, so what Sean actually meant was that you could use htmltidy as to convert the page to valid XHTML. Once you have done that you could use your initially suggested QXmlStreamReader, parse the "tidy" XHMTL page and extract the DOM elements you need.
>>
>>> Can I use QWebPage, not display it on screen and find a way to get
>>> certain fields ?
>>
>> You can access elements via javascript which has been discussed many
>> times on the list.
>>
>
> It sounds like a complicated way to do a simple thing...
> No simpliest code ?
Well, the following is a cheap hack (but since "cheap" most often is a prerequisite to "best method" in economic thinking, it's not that bad, hehe ;). But as long as Google doesn't notice that you "steal" from their page... okay, here we go :)
I just checked the result page of http://translate.google.fr/ and it seems that the translated text is put into a <div>:
<div id=result_box dir="ltr">Ceci est un test</div>
The nice thing is that this <div> has a *unique* ID already (id=result_box - note that valid (X)HMLT would already require "" here, as in id="result_box"), making it easy to parse. So either you use a plain and simple QString::indexOf (http://doc.trolltech.com/4.4/qstring.html#indexOf) of the substring "<div id=result_box dir=\"ltr\">" and parse everything after it, until you find a </div>. Note: you don't even need to worry about nasty users - such as me - trying to have something like "This is a nasty </div>" translated. Google already takes care of this and translates the </div> to </ div> :)
Or - a bit more elegant and even easier - you use a QRegExp ("regular expression"), maybe something like:
QString htmlText = ...;
QRegExp regExp(""<div id=result_box dir=\"ltr\">(.*)</div>"");
int pos = regExp.indexIn(htmlText);
if (pos > -1) {
QString translatedText = regExp.cap(1);
...
}
Note the QRegExp example in {QTDIR}/examples/tools/regexp/ - it allows you to test your expressions (mine above might not be bullet-proof!).
Again, this works until Google realises your little service exploitation and changes their HTML code ;)
Cheers, Oliver
More information about the Qt-interest-old
mailing list