[Qt-interest] Messed-up filename encoding (was: Messed-up encoding to UTF-8)

Jeffrey Brendecke jwbrendecke at icanetix.com
Sun Jul 19 03:06:17 CEST 2009


On Saturday 18 July 2009 18:22:54 you wrote:

>
> There's nothing wrong with the Qt encoding.
>
> Your source is already encoded UTF-8 and you decoded it as Latin1 when
> reading into QString. So when you call toUtf8(), it doubly encodes.
>
> Please check your source data.

===========
( Sending again using the email address I have registered with the mailing 
list)

(I updated the subject line to reflect the problem better.)

I took a look at the source again. The QString which is being converted to an 
8-bit encoding originally came from a traversal of an ext3 file system tree 
(Kubuntu Linux ) driven through QDir::entryInfoList(). The filename is taken 
from that of the member QFileInfo in the resulting QFileInfoList. From that 
point on, the string is maintained as a QString up until it is encoded later 
on to utf-8. You are correct. What I reported earlier has nothing to do with 
QTextEncoder.

The filename in question is "SAMEFILENÄMECHANGE.txt".

If I take a look at the filename itself in the shell, I see this:

$ ls -1 SAMEFILENÄMECHANGE.txt | xxd
0000000: 5341 4d45 4649 4c45 4ec3 844d 4543 4841  SAMEFILEN..MECHA
0000010: 4e47 452e 7478 740a                      NGE.txt.

At the point in question, I do see the sequence c3 84, but that is to be 
expected as the locale is en-US at utf-8.

So, I hacked the code to give a dump of the values of the 16-bit elements of 
the original string and I am finding what you hinted at, a double encoding.

I used this code to dump the values:

std::wstring spW = fileName.toStdWString();
for ( int i=0, count=spW.size(); i<count; ++i )
{
  QString str( "%1 " );
  quint16 val = static_cast<quint16>( spW[i] );
  QString outStr = QString( "%1 " ).arg( val, 0, 16 );
  std::cerr << outStr.toLocal8Bit().constData();
}
std::cerr << std::endl;

Now, I am seeing the problem: The sequence is explicitly showing up as 16-bit 
members of the string:

S  A  M  E  F  I  L  E  N  Ä     M  E  C  H  A  N  G  E  .  t  x  t
53 41 4d 45 46 49 4c 45 4e c3 84 4d 45 43 48 41 4e 47 45 2e 74 78 74

I would have expected a single "c4" instead of  "c3 " followed by "84".

I even tried manually creating another filename in the shell on the computer 
with the character "Ä" in it and it results in the same problem when the 
contents of the directory is listed as noted above.

It seems that the Qt functionality should have correctly encoded the filename 
from the 8-bit name to Unicode:

$ printf '\xc3\x84' | iconv -f utf-8 -t UNICODE | xxd
0000000: fffe c400                                ....

$ printf '\xff\xfe\xc4\x00' | iconv -f UNICODE -t utf-8
Ä

Am I missing something really fundamental? 




More information about the Qt-interest-old mailing list