[Development] Deprecating QFile::encodeName/decodeName

Wed Jun 6 00:47:23 CEST 2012

On terça-feira, 5 de junho de 2012 16.48.58, João Abecasis wrote:
> Thiago Macieira wrote:
> > On terça-feira, 5 de junho de 2012 13.20.36, João Abecasis wrote:
> >> I would not go as far as saying that the concept is broken, but it
> >> definitely isn't portable. One big issue is that nowadays we don't
> >> use an 8-bit encoding on Windows and those functions assume QString
> >> <-> QByteArray conversions.
> > 
> > I called it broken due to the ambiguity when it comes to interpreting
> > data that may or may not contain file names, like command-line
> > arguments. See the next paragraph of what I had written.
> > 
> > Codecs must never be used in a situation where there's an ambiguity on
> > what the encoding is. That will lead to mojibake sooner or later.
> 
> That's the whole point, you shouldn't treat file names differently when
> processing the command line or configuration files. Those should be
> interpreted using their specific rules: command line goes according to
> locale, configuration goes according to established convention (say,
> UTF-8).

So you're asking that filenames be passed on the locale encoding (say, UTF-8) 
on the command-line, regardless of what the filesystem encoding is?

That is a possible solution, it requires that all tools be taught to do that, 
including the ones that currently couldn't care less about encodings. Many 
shell tools assume that the file name can be copied verbatim from the DIR* 
entry (from readdir(2)) to the screen.

I don't see that happening any more than I see any of the other solutions. The 
only pragmatic solution is to enforce filesystem encoding == locale encoding.

In fact, there is one more possible solution which stands a chance: forcing 
the problem onto the kernel. Make the entire userspace API be UTF-8 and have 
the kernel recode to the filesystem encoding as necessary. The problem with 
this solution is that it a) will suffer extreme resistance from kernel 
developers and other people who think of file names as "binary data" instead of 
human-readable text; and b) is no different from the other solution of 
enforcing the encoding.

[snip]
>     QtGit commit -m $'R\xc3\xa9sum\xc3\xa9' 'R%E9sum%E9.txt'
> 
> Bash enables you to pass around byte sequences it doesn't understand.
> That would help in getting UTF-8 into the application but would not help
> pass in Latin1, as the application would (correctly) flag it as an
> invalid UTF-8 sequence and replace 'é' with '?'. Instead, we need to
> escape the file name in a way that will not be misinterpreted by bash or
> command line processing, giving code that actually expects a path (say,
> QFile, encodeName or QFileSystemEntry) a chance to interpret it.
> 
> For this to work, command line parsing need not understand more than
> UTF-8, but encodeName would then be able to convert those \xHH sequences
> to the proper Latin1-encoded "Résumé" and retrieve the right file from
> the file system.

You used %HH instead of \xHH in your example, which is the URL encoding and 
for which there are well-defined rules. I'd prefer that.

> We're talking about the edge cases, right? 99% of uses will work with or
> without the need for encode/decodeName.
> 
> Currently, we don't support the edge cases at all. If we were able to
> support the edge cases but not be able to interoperate with other
> applications on those cases it would already be an improvement.

What we don't currently support are file names that cannot be decoded by the 
locale codec, such as a latin1("Résumé.txt") on a UTF-8 system. If that's what 
you mean by edge cases, then what's what I meant.

What I cannot agree with is sacrificing the normal case for the edge case. It 
might be possible to fix the problem of the edge case above (a restricted 
issue) with little fall-out, but I don't see a way of solving the generic 
problem of filesystem encoding != locale encoding.

> > The way I see it, the only way that this could work is if two
> > conditions were met:
> > 
> > 1) there is a cross-toolkit, lossless and encoding-independent
> > representation of a filepath
> 
> This would be nice, agreed.

I'd say it's mandatory. If we only try to solve the restricted problem of 
filenames outside the filesystem encoding (which is equal to the locale 
encoding), there's a possible solution similar to Qt3's.

If we want to solve the generic problem, then we must always transmit data in 
a specific encoding between applications. That requires, in turn, that all 
applications be modified to understand that for the common case (not just the 
edge case).

> > [Desktop Entry]
> > Encoding=UTF-8
> > # the third argument to git commit is a URI reference for a filename
> > # called "Résumé.txt" encoded in Latin 1
> > Exec=git commit -m Résumé R%E9sum%E9.txt
> > 
> > without modifications, git would not understand that representation
> > and would not find the file on disk. Changing *every* *single*
> > application under the sun to use a different representation for file
> > paths than what they do today is not feasible.
> 
> What a lot of applications will do is treat byte-sequences (strings)
> agnostically and not validate them as valid UTF-8 sequences, allowing
> you to still access files with those "funny" names.

Only if the calling application or the user wrote those "funny names" 
literally, in the format that the filesystem OS functions expect. That is not 
the case above, where we used a different, specific encoding.

> While I agree with a lot of what you are saying about what is reasonable
> and feasible for encodings of file names (heck, everyone should be using
> UTF-8 or shot on the spot!), Qt as a toolkit should not lock you out of
> files that mistakenly or not ended up with those "funny" names. One
> consequence of using locale to decide how to encode file names is that
> it is all too easy to come across those files (e-mail, file sharing, USB
> sticks and whatnot).
> 
> Again, as a general purpose toolkit Qt needs to allow you to read,
> rename and delete those files. It should potentially allow you to store
> their names and come back to them at a later time.

Again, we're mixing the general problem with the edge case.

There is a possible solution for the edge case, implementable without 
sacrificing the normal case, and which would cause little fall out in terms of 
interoperability. But I really, really do not see how we can solve the generic 
case of allowing the user to change the filesystem encoding at will. And by 
user here, I mean both the developer using Qt by calling 
QFile::setEncodingFunction as well as the end user toggling some configuration 
switches in the system.

To solve the edge case, we need to somehow store in a regular QString a byte 
sequence that can be converted back to its original 8-bit form, regardless of 
the locale encoding being used. If we say that changing the codecs themselves 
in QTextCodec and QString is out of the question (that was the Qt 3 solution), 
then the only place remaining is QFile::encodeName and QFile::decodeName. By 
necessity, they must do more than just QString::{to,from}Local8Bit.

>From there, we come to the conclusion that the QString representing such a file 
name must contain special processing instructions (e.g., one or more special 
characters). One form of special processing instruction is escaping each 
character, like URLs do. The problem with the approach of escaping is what to 
do when the escape character occurs in a file name. If that is a possibility, 
the escape character needs to be escaped by itself (like "\\" for backslashes 
in C or "%25" for percents in URLs). If we use this approach, then we will not 
interoperate properly with non-Qt applications when this character happens.

The only sane solution, then, is to use a character that has a very small 
chance of ever being used or, better yet, a zero chance (I don't think there's 
any). If that happens, then this character will be close to "untypeable" on 
the terminal. Not a big loss, I'd say.

In fact, I'd recommend that, instead of escaping each bad character, we escape 
each path component (the escaped sequence ends at the next slash). That is, 
suppose I am a Greek user and I unpacked a bad .zip file on the "My Documents" 
folder, which is called:

	/home/foo/έγγραφα

If my file was called "Résume.txt" in Latin1, the QString representing such a 
file name would be:
	/home/foo/έγγραφα/<escape>Résumé.txt

If it was named "βιογραφικό σημείωμα.txt" in ISO-8859-7, the QString 
representation would be:
	/home/foo/έγγραφα/<escape>âéïãñáöéêü óçìåßùìá.txt

That has the drawback of being hard to use when it comes to path manipulation. 
Appending, prepending, extracting or inserting text could have unexpected 
consequences.

An intermediate option would be to escape each sequence of non-locale 
characters. Instead of the escaping ending at the slash, an unescape character 
is necessary. For simplicity, let's say ⟪ shifts and ⟫ unshifts, the file could 
be:

	/home/foo/έγγραφα/⟪biocqavij|⟫ ⟪sgle_yla⟫.txt

Pros:
 - implementable, with little fall-out for interoperability if we choose the 
escape character well
 - probably survives a round-trip through the locale codec, so the escaping 
isn't lost
 - since it survives the round-trip, it can be used across QProcess, on the 
command-line, etc.
 - survives the user too, since it can be copy & pasted, edited, provided that 
the escape characters remain

Limitations:
 a) Qt-only, I don't expect anyone else to use such file names
 b) if encodeName() isn't used properly, it leads to a bad encoding of the file 
name onto 8-bit. Applications dealing with the filesystem need to be extra 
careful so as to not show two representations of the same file.
 c) for that matter, it's possible to produce an escaped form that matches a 
regular file name
 d) double representations are often a source of security issues if not 
handled carefully (cf. overlong sequences in UTF-8)

As you can see, I didn't come up with this today. I've known these 
alternatives for years. I don't think they're worth our time.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
     Intel Sweden AB - Registration Number: 556189-6027
     Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120606/8fc2cd8a/attachment.sig>