[Interest] Code Review For Correcting Encoding Errors
Kyle Neal
kyle at verias.com
Thu Apr 2 16:39:41 CEST 2015
First of all, thanks for taking the time to respond.
On Wed, Apr 1, 2015 at 6:00 PM, Thiago Macieira <thiago.macieira at intel.com>
wrote:
> On Wednesday 01 April 2015 17:07:55 Kyle Neal wrote:
> > Here is the implementation of my valid utf8 checker
> >
> > bool Parser::isUTF8( std::string string )
> > {
> > QString utf8str = QString::fromUtf8( string.c_str() );
> >
> > for ( int i = 0; i < utf8str.length(); i++ ) {
> > if ( utf8str.at( i ) == -3 ) {
> > return false;
> > }
> >
> > return true;
> > }
>
> This is wrong. Technically speaking, the source could have the UTF-8
> version
> of the U+FFFD character, in which case it's valid UTF-8 but you'd return
> false.
>
> Instead, use QTextCodec with a stateful decoder and check if the number of
> invalid characters is non-zero.
>
> That is:
>
> QTextCodec::ConverterState state;
> QTextCodec *utf8Codec = QTextCodec::codecForMib(106);
> QString result = utf8Codec->toUnicode(string.c_str(), string.length(),
> &state);
> return state.invalidChars
>
Perhaps my understanding was wrong. My understanding of the requirement I
was told was that if there are any UTF8 replacement characters in a field
they should be removed. I was thinking that this is because a replacement
character indicates that it is not valid UTF8? That being said, does this
still deem my implementation to be incorrect?
Using your implementation, invalid chars remain 0 even if a sample row is "
test at example.com,��� is an illegal utf-8 byte sequence" or "test at example.com,no
illegal UTF8 characters". I believe that is what the expected results are
since U+FFFD is valid UTF8.
> I would also recommend that you:
>
> - don't discard that QString result. Reuse it.
> - don't use std::string to represent an encoding. A QTextCodec pointer is
> the
> right way.
>
I agree, this would save me from having to map the string description to an
encoding and allows me to remove that big if block.
> - use the code above to check any encoding, not just UTF-8
> - stop using ifstream
>
This is actually something I want to do. But, I have not found a simple
solution to handle '\r' line endings with QTextStream. So because of this,
I use ifstream to read from the source file, and QTextStream everywhere
else. Suggestions on this?
> --
> Thiago Macieira - thiago.macieira (AT) intel.com
> Software Architect - Intel Open Source Technology Center
>
> _______________________________________________
> Interest mailing list
> Interest at qt-project.org
> http://lists.qt-project.org/mailman/listinfo/interest
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20150402/89ff084d/attachment.html>
More information about the Interest
mailing list