[Interest] Code Review For Correcting Encoding Errors

Wed Apr 1 23:07:55 CEST 2015

Hey guys, first of all, I've been an active member in #qt and I appreciate
all the help I've received from the guys in there. I started Qt a couple of
months ago for a new project and I could not have got this far without
everyone's help. I am very grateful.

I'm hoping for a little bit of a code review to ensure I am doing UTF-8
right. I am developing a csv fixer that allows a user to select the
encoding of the file so we can correctly handle encoding errors. Also,
there is a choice to skip this, in which case I am just removing the fields
that have encoding errors.

So what I need is a bit of a review just to be absolutely sure that I am
doing this right. I know encoding is almost an impossible problem but
hopefully I can be told otherwise.

My logic is as follows. At the beginning, I read the source CSV file
counting the number of rows with invalid UTF8 characters using my bool
isUTF8() shown below. I collect either 200 rows with encoding errors, or
all encoding errors of the file. Whichever comes first. These rows are then
shown to the user, and they are given a screen that allows them to pick the
encoding of the file, showing the data as the choose. Once this is done,
the file is read from the beginning as the selected encoding. Meaning, I
read the row from the csv file, then encode the string as the selected
encoding the user specified, and attempt to write them to the output file.
Any encoding errors (non-UTF8 data) that were encountered as the selected
encoding are written to an encoding errors file. At the end of parsing, the
user can choose the encoding of those items until all items are written to
the output file.

Now this implementation seems to correctly ensure only valid UTF8 is
written to the file, but it seems that a majority of the data with encoding
errors just gets replaced by a replacement character that is valid UTF8
defeating the purpose. And if I just skip this step and choose to remove
all fields with encoding errors, it seems to work perfect, except for
Windows in which case badly encoded characters are being replaced with '?'.

Anyways, here is the implementation of my string encoder. This function
gets the std::string row of the .csv file.

inline QString encode_string( std::string str, std::string encoding )
{
    QByteArray encoded;

    if ( encoding == UTF8 ) {
        encoded = QString::fromStdString( str ).toUtf8();
    }  else if ( encoding == ISO88591 ) {
        QTextCodec *codec = QTextCodec::codecForName( ISO88591 );
        QByteArray enc( str.c_str(), str.length() );
        return QString( codec->toUnicode( enc ) );
    } else if ( encoding == ISO88592 ) {
        QTextCodec *codec = QTextCodec::codecForName( ISO88592 );
        QByteArray enc( str.c_str(), str.length() );
        return QString( codec->toUnicode( enc ) );
    } else if ( encoding == WINDOWS1251 ) {
        QTextCodec *codec = QTextCodec::codecForName( WINDOWS1251 );
        QByteArray enc( str.c_str(), str.length() );
        return QString( codec->toUnicode( enc ) );
    } else if ( encoding == WINDOWS1252 ) {
        QTextCodec *codec = QTextCodec::codecForName( WINDOWS1252 );
        QByteArray enc( str.c_str(), str.length() );
        return QString( codec->toUnicode( enc ) );
    } else if ( encoding == SHIFTJIS ) {
        QTextCodec *codec = QTextCodec::codecForName( SHIFTJIS );
        QByteArray enc( str.c_str(), str.length() );
        return QString( codec->toUnicode( enc ) );
    } else if ( encoding == EUCKR ) {
        QTextCodec *codec = QTextCodec::codecForName( EUCKR );
        QByteArray enc( str.c_str(), str.length() );
        return QString( codec->toUnicode( enc ) );
    } else if ( encoding == EUCJP ) {
        QTextCodec *codec = QTextCodec::codecForName( EUCJP );
        QByteArray enc( str.c_str(), str.length() );
        return QString( codec->toUnicode( enc ) );
    } else {
        qDebug() << Q_FUNC_INFO  << "Hit bad encoding case.";
        return QString( encoded );
    }
}

Here is the implementation of my valid utf8 checker

 bool Parser::isUTF8( std::string string )
 {
    QString utf8str = QString::fromUtf8( string.c_str() );

    for ( int i = 0; i < utf8str.length(); i++ ) {
        if ( utf8str.at( i ) == -3 ) {
            return false;
        }

       return true;
 }

And here is my call point:

//Write utf8 version of the string QString encoded = util::encode_string(
joined, this->encoding );

    //If the encoded string has UTF8 errors, write it to the encode error
file
    if ( !isUTF8( encoded.toStdString() ) ) {
        QFile encode( this->encodeErrFileName );
        encode.open( QIODevice::ReadWrite | QIODevice::Append );
        QTextStream encodeOut( &encode );
        encodeOut << encoded << "\r\n";
        encode.close();
    } else {
        emit cleanRow();
        tmpFileWriter << encoded << "\r\n";
    }

    output.close();

My function calls go like this:

Read file as standard (C++ ifstream, no encoding done) -> encode the string
(using util::encode_string) -> check if encoded string is valid UTF8 (using
bool isUTF8(str)) -> if true, write to output file, if false write to
encoding errors file.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20150401/7c8fb5ab/attachment.html>