[Qt-interest] Heuristics for determining text codec of file

Tue Jan 11 15:49:11 CET 2011

On 11.01.11 12:12:41, Robert Hairgrove wrote:
> On Tue, 2011-01-11 at 15:05 +0900, suzuki toshiya wrote:
> > Hi,
> > 
> > Although I've never tried to build or use, I heard that
> > the character encoding detection of Mozilla can be built
> > as a standalone module:
> > 
> > Very old description:
> > http://www.mozilla.org/projects/intl/detectorsrc.html
> > 
> > source code:
> > http://hg.mozilla.org/mozilla-central/file/3ac595ba8c43/extensions/universalchardet
> > 
> > If you think Mozilla's detection is sufficient for you,
> > please try.
> 
> On second look, I have to modify the Mozilla source code in order to use
> it like this. Unfortunately, my app will be LGPLed, so looks like I will
> have to roll my own. Besides, as you have pointed out, it is very old
> code... and looks like Cyrillic isn't handled too well.

kdelibs' encoding-detection got a face-lift sometime in the last 2 years,
the code is lgpl and somewhat based on heuristics from a browser (not sure
which one exactly):
http://websvn.kde.org/trunk/KDE/kdelibs/kdecore/localization/

In particular the kencodingdetector and kencodingprober classes in that
directory seem to do the job.

Andreas

-- 
An exotic journey in downtown Newark is in your future.