[Qt-interest] Heuristics for determining text codec of file

Tue Jan 11 11:21:23 CET 2011

Thank you very much! Looks promising to me.

--

On Tue, 2011-01-11 at 15:05 +0900, suzuki toshiya wrote:
> Hi,
> 
> Although I've never tried to build or use, I heard that
> the character encoding detection of Mozilla can be built
> as a standalone module:
> 
> Very old description:
> http://www.mozilla.org/projects/intl/detectorsrc.html
> 
> source code:
> http://hg.mozilla.org/mozilla-central/file/3ac595ba8c43/extensions/universalchardet
> 
> If you think Mozilla's detection is sufficient for you,
> please try.
> 
> Regards,
> mpsuzuki
> 
> Robert Hairgrove wrote:
> > I am importing data from text files in the application I am writing.
> > However, users are not necessarily aware of the codec/encoding of their
> > files, so I would like to try to guess it within the application (I know
> > this isn't 100% reliable) but let the user override it if they do know
> > what they are doing.
> > 
> > Looking at the documentation for QTextCodec and related functions, it
> > seems that most of the "convert..." functions expect the data to have a
> > byte order mark (BOM). But this is usually not the case on non-Windows
> > systems (don't know about Mac these days).
> > 
> > Is there a library freely available which can take, for example, the
> > first 4K bytes of text and scan it for extended or Unicode characters,
> > returning the probable codec used? I'm sure this would come in handy for
> > many people. :)
> > 
> > Thank you.