[Interest] Code Review For Correcting Encoding Errors

Thiago Macieira thiago.macieira at intel.com
Thu Apr 2 17:47:31 CEST 2015


On Thursday 02 April 2015 10:39:41 Kyle Neal wrote:
> Perhaps my understanding was wrong. My understanding of the requirement I
> was told was that if there are any UTF8 replacement characters in a field
> they should be removed. I was thinking that this is because a replacement
> character indicates that it is not valid UTF8? That being said, does this
> still deem my implementation to be incorrect?

You tell me.

Do you consider a block of text that is otherwise valid but contains the 
sequence EF BF BD to be invalid?

I wouldn't. But if you're going to reject some code points, you should reject 
ALL code points that point to non-characters too (U+FDD0 to U+FDEF).

> Using your implementation, invalid chars remain 0 even if a sample row is "
> test at example.com,��� is an illegal utf-8 byte sequence" or
> "test at example.com,no illegal UTF8 characters". I believe that is what the
> expected results are since U+FFFD is valid UTF8.

Right.

> > I would also recommend that you:
> >  - don't discard that QString result. Reuse it.
> >  - don't use std::string to represent an encoding. A QTextCodec pointer is
> > the
> >    right way.
> 
> I agree, this would save me from having to map the string description to an
> encoding and allows me to remove that big if block.
> 
> >  - use the code above to check any encoding, not just UTF-8
> >  - stop using ifstream
> 
> This is actually something I want to do. But, I have not found a simple
> solution to handle '\r' line endings with QTextStream. So because of this,
> I use ifstream to read from the source file, and QTextStream everywhere
> else. Suggestions on this?

Open the QFile in QIODevice::Text mode. And I thought QTextStream handled them
too, automatically even.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center




More information about the Interest mailing list