[Interest] Code Review For Correcting Encoding Errors

Kyle Neal kyle at verias.com
Thu Apr 2 18:09:45 CEST 2015


On Thu, Apr 2, 2015 at 11:47 AM, Thiago Macieira <thiago.macieira at intel.com>
wrote:

> On Thursday 02 April 2015 10:39:41 Kyle Neal wrote:
> > Perhaps my understanding was wrong. My understanding of the requirement I
> > was told was that if there are any UTF8 replacement characters in a field
> > they should be removed. I was thinking that this is because a replacement
> > character indicates that it is not valid UTF8? That being said, does this
> > still deem my implementation to be incorrect?
>
> You tell me.
>
> Do you consider a block of text that is otherwise valid but contains the
> sequence EF BF BD to be invalid?
>
> I wouldn't. But if you're going to reject some code points, you should
> reject
> ALL code points that point to non-characters too (U+FDD0 to U+FDEF).
>

Excellent point.


>
> > Using your implementation, invalid chars remain 0 even if a sample row
> is "
> > test at example.com,��� is an illegal utf-8 byte sequence" or
> > "test at example.com,no illegal UTF8 characters". I believe that is what
> the
> > expected results are since U+FFFD is valid UTF8.
>
> Right.
>
> > > I would also recommend that you:
> > >  - don't discard that QString result. Reuse it.
> > >  - don't use std::string to represent an encoding. A QTextCodec
> pointer is
> > > the
> > >    right way.
> >
> > I agree, this would save me from having to map the string description to
> an
> > encoding and allows me to remove that big if block.
> >
> > >  - use the code above to check any encoding, not just UTF-8
> > >  - stop using ifstream
> >
> > This is actually something I want to do. But, I have not found a simple
> > solution to handle '\r' line endings with QTextStream. So because of
> this,
> > I use ifstream to read from the source file, and QTextStream everywhere
> > else. Suggestions on this?
>
> Open the QFile in QIODevice::Text mode. And I thought QTextStream handled
> them
> too, automatically even.
>

Unfortunately opening a QFile with the QIODevice::Text mode flag severely
breaks my program if it is run under Windows (since Windows has a weird
thing of converting line endings). Doing this makes my file position to be
way off. Also, QTextStream does not handle '\r' automatically. If I read a
csv file with '\r' endings using QTextStream::readLine(), it reads the
entire file into one QString. I remember reading that this was a bug that
was reported, but was rejected and deemed "out of scope". In the discussion
of this bug, a Qt guy says that QTextstream will never handle '\r' because
he claimed that Qt framework is not used in OSes that use '\r' endings. You
can see this here https://bugreports.qt.io/browse/QTBUG-18038


>
> --
> Thiago Macieira - thiago.macieira (AT) intel.com
>   Software Architect - Intel Open Source Technology Center
>
> _______________________________________________
> Interest mailing list
> Interest at qt-project.org
> http://lists.qt-project.org/mailman/listinfo/interest
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20150402/6af5ae95/attachment.html>


More information about the Interest mailing list