[Qt-interest] Heuristics for determining text codec of file

Tue Jan 11 07:05:25 CET 2011

Hi,

Although I've never tried to build or use, I heard that
the character encoding detection of Mozilla can be built
as a standalone module:

Very old description:
http://www.mozilla.org/projects/intl/detectorsrc.html

source code:
http://hg.mozilla.org/mozilla-central/file/3ac595ba8c43/extensions/universalchardet

If you think Mozilla's detection is sufficient for you,
please try.

Regards,
mpsuzuki

Robert Hairgrove wrote:
> I am importing data from text files in the application I am writing.
> However, users are not necessarily aware of the codec/encoding of their
> files, so I would like to try to guess it within the application (I know
> this isn't 100% reliable) but let the user override it if they do know
> what they are doing.
> 
> Looking at the documentation for QTextCodec and related functions, it
> seems that most of the "convert..." functions expect the data to have a
> byte order mark (BOM). But this is usually not the case on non-Windows
> systems (don't know about Mac these days).
> 
> Is there a library freely available which can take, for example, the
> first 4K bytes of text and scan it for extended or Unicode characters,
> returning the probable codec used? I'm sure this would come in handy for
> many people. :)
> 
> Thank you.