[Interest] Code Review For Correcting Encoding Errors
Thiago Macieira
thiago.macieira at intel.com
Thu Apr 2 17:47:31 CEST 2015
On Thursday 02 April 2015 10:39:41 Kyle Neal wrote:
> Perhaps my understanding was wrong. My understanding of the requirement I
> was told was that if there are any UTF8 replacement characters in a field
> they should be removed. I was thinking that this is because a replacement
> character indicates that it is not valid UTF8? That being said, does this
> still deem my implementation to be incorrect?
You tell me.
Do you consider a block of text that is otherwise valid but contains the
sequence EF BF BD to be invalid?
I wouldn't. But if you're going to reject some code points, you should reject
ALL code points that point to non-characters too (U+FDD0 to U+FDEF).
> Using your implementation, invalid chars remain 0 even if a sample row is "
> test at example.com,��� is an illegal utf-8 byte sequence" or
> "test at example.com,no illegal UTF8 characters". I believe that is what the
> expected results are since U+FFFD is valid UTF8.
Right.
> > I would also recommend that you:
> > - don't discard that QString result. Reuse it.
> > - don't use std::string to represent an encoding. A QTextCodec pointer is
> > the
> > right way.
>
> I agree, this would save me from having to map the string description to an
> encoding and allows me to remove that big if block.
>
> > - use the code above to check any encoding, not just UTF-8
> > - stop using ifstream
>
> This is actually something I want to do. But, I have not found a simple
> solution to handle '\r' line endings with QTextStream. So because of this,
> I use ifstream to read from the source file, and QTextStream everywhere
> else. Suggestions on this?
Open the QFile in QIODevice::Text mode. And I thought QTextStream handled them
too, automatically even.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Interest
mailing list