[Interest] Code Review For Correcting Encoding Errors

Thiago Macieira thiago.macieira at intel.com
Thu Apr 2 00:00:38 CEST 2015


On Wednesday 01 April 2015 17:07:55 Kyle Neal wrote:
> Here is the implementation of my valid utf8 checker
> 
>  bool Parser::isUTF8( std::string string )
>  {
>     QString utf8str = QString::fromUtf8( string.c_str() );
> 
>     for ( int i = 0; i < utf8str.length(); i++ ) {
>         if ( utf8str.at( i ) == -3 ) {
>             return false;
>         }
> 
>        return true;
>  }

This is wrong. Technically speaking, the source could have the UTF-8 version 
of the U+FFFD character, in which case it's valid UTF-8 but you'd return 
false.

Instead, use QTextCodec with a stateful decoder and check if the number of 
invalid characters is non-zero.

That is:

    QTextCodec::ConverterState state;
    QTextCodec *utf8Codec = QTextCodec::codecForMib(106);
    QString result = utf8Codec->toUnicode(string.c_str(), string.length(),
    	&state);
    return state.invalidChars;

I would also recommend that you:

 - don't discard that QString result. Reuse it.
 - don't use std::string to represent an encoding. A QTextCodec pointer is the 
   right way.
 - use the code above to check any encoding, not just UTF-8
 - stop using ifstream
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center




More information about the Interest mailing list