[Interest] Code Review For Correcting Encoding Errors
Thiago Macieira
thiago.macieira at intel.com
Thu Apr 2 00:00:38 CEST 2015
On Wednesday 01 April 2015 17:07:55 Kyle Neal wrote:
> Here is the implementation of my valid utf8 checker
>
> bool Parser::isUTF8( std::string string )
> {
> QString utf8str = QString::fromUtf8( string.c_str() );
>
> for ( int i = 0; i < utf8str.length(); i++ ) {
> if ( utf8str.at( i ) == -3 ) {
> return false;
> }
>
> return true;
> }
This is wrong. Technically speaking, the source could have the UTF-8 version
of the U+FFFD character, in which case it's valid UTF-8 but you'd return
false.
Instead, use QTextCodec with a stateful decoder and check if the number of
invalid characters is non-zero.
That is:
QTextCodec::ConverterState state;
QTextCodec *utf8Codec = QTextCodec::codecForMib(106);
QString result = utf8Codec->toUnicode(string.c_str(), string.length(),
&state);
return state.invalidChars;
I would also recommend that you:
- don't discard that QString result. Reuse it.
- don't use std::string to represent an encoding. A QTextCodec pointer is the
right way.
- use the code above to check any encoding, not just UTF-8
- stop using ifstream
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Interest
mailing list