[Interest] Code Review For Correcting Encoding Errors

Kyle Neal kyle at verias.com
Thu Apr 2 16:39:41 CEST 2015


First of all, thanks for taking the time to respond.

On Wed, Apr 1, 2015 at 6:00 PM, Thiago Macieira <thiago.macieira at intel.com>
wrote:

> On Wednesday 01 April 2015 17:07:55 Kyle Neal wrote:
> > Here is the implementation of my valid utf8 checker
> >
> >  bool Parser::isUTF8( std::string string )
> >  {
> >     QString utf8str = QString::fromUtf8( string.c_str() );
> >
> >     for ( int i = 0; i < utf8str.length(); i++ ) {
> >         if ( utf8str.at( i ) == -3 ) {
> >             return false;
> >         }
> >
> >        return true;
> >  }
>
> This is wrong. Technically speaking, the source could have the UTF-8
> version
> of the U+FFFD character, in which case it's valid UTF-8 but you'd return
> false.
>
> Instead, use QTextCodec with a stateful decoder and check if the number of
> invalid characters is non-zero.
>
> That is:
>
>     QTextCodec::ConverterState state;
>     QTextCodec *utf8Codec = QTextCodec::codecForMib(106);
>     QString result = utf8Codec->toUnicode(string.c_str(), string.length(),
>         &state);
>     return state.invalidChars
>

Perhaps my understanding was wrong. My understanding of the requirement I
was told was that if there are any UTF8 replacement characters in a field
they should be removed. I was thinking that this is because a replacement
character indicates that it is not valid UTF8? That being said, does this
still deem my implementation to be incorrect?

Using your implementation, invalid chars remain 0 even if a sample row is "
test at example.com,��� is an illegal utf-8 byte sequence" or "test at example.com,no
illegal UTF8 characters". I believe that is what the expected results are
since U+FFFD is valid UTF8.


> I would also recommend that you:
>
>  - don't discard that QString result. Reuse it.
>  - don't use std::string to represent an encoding. A QTextCodec pointer is
> the
>    right way.
>

I agree, this would save me from having to map the string description to an
encoding and allows me to remove that big if block.


>  - use the code above to check any encoding, not just UTF-8
>  - stop using ifstream
>

This is actually something I want to do. But, I have not found a simple
solution to handle '\r' line endings with QTextStream. So because of this,
I use ifstream to read from the source file, and QTextStream everywhere
else. Suggestions on this?



> --
> Thiago Macieira - thiago.macieira (AT) intel.com
>   Software Architect - Intel Open Source Technology Center
>
> _______________________________________________
> Interest mailing list
> Interest at qt-project.org
> http://lists.qt-project.org/mailman/listinfo/interest
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20150402/89ff084d/attachment.html>


More information about the Interest mailing list