[Development] Deprecating QFile::encodeName/decodeName

João Abecasis joao.abecasis at nokia.com
Wed Jun 6 16:51:14 CEST 2012


Thiago Macieira wrote:
> On terça-feira, 5 de junho de 2012 16.48.58, João Abecasis wrote:
>> Thiago Macieira wrote:
>>> On terça-feira, 5 de junho de 2012 13.20.36, João Abecasis wrote:
>>>> I would not go as far as saying that the concept is broken, but it
>>>> definitely isn't portable. One big issue is that nowadays we don't
>>>> use an 8-bit encoding on Windows and those functions assume QString
>>>> <-> QByteArray conversions.
>>> 
>>> I called it broken due to the ambiguity when it comes to interpreting
>>> data that may or may not contain file names, like command-line
>>> arguments. See the next paragraph of what I had written.
>>> 
>>> Codecs must never be used in a situation where there's an ambiguity on
>>> what the encoding is. That will lead to mojibake sooner or later.
>> 
>> That's the whole point, you shouldn't treat file names differently when
>> processing the command line or configuration files. Those should be
>> interpreted using their specific rules: command line goes according to
>> locale, configuration goes according to established convention (say,
>> UTF-8).
> 
> So you're asking that filenames be passed on the locale encoding (say, UTF-8) 
> on the command-line, regardless of what the filesystem encoding is?

I see no other sane way, unless your application is able to take the byte sequences it gets without additional processing. In Qt, strings are 16-bit, so 8-bit strings typically need conversion.

> That is a possible solution, it requires that all tools be taught to do that, 
> including the ones that currently couldn't care less about encodings. Many 
> shell tools assume that the file name can be copied verbatim from the DIR* 
> entry (from readdir(2)) to the screen.

I think phrasing the problem as such that we "require all tools [to] be taught" a new trick doesn't really help. We're obviously not going to be able to force every other tool out there to change. Not any time soon, anyway. Still, some of the same tools you want be taught a new trick don't have a problem to begin with, as they're quite capable of reading and writing files with "funny" names.

Tools that do all processing in 8-bit can just wait for system calls to error out, they don't really need to interpret encodings the same way Qt does.

Qt has a problem and it's one that allows it to show files with "funny" names in file dialogs and through QDir(Iterator), but will prevent most attempts to interact with those files in practice. Again, encode/decodeName enabled application developers to handle these issues.

> I don't see that happening any more than I see any of the other solutions. The 
> only pragmatic solution is to enforce filesystem encoding == locale encoding.

Locales change, either between users or because some users switch locales. Unless you can enforce that all locales use the same encoding (nowadays much less of a problem, granted) using user locale is just wrong. Even if not on the same computer, USB sticks are still a source of files with "funny" names.

> In fact, there is one more possible solution which stands a chance: forcing 
> the problem onto the kernel. Make the entire userspace API be UTF-8 and have 
> the kernel recode to the filesystem encoding as necessary. The problem with 
> this solution is that it a) will suffer extreme resistance from kernel 
> developers and other people who think of file names as "binary data" instead of 
> human-readable text; and b) is no different from the other solution of 
> enforcing the encoding.

Forcing this onto the Linux kernel would in the long term make the situation better for Linux users that don't receive files from any other OSs or kernel versions. It doesn't help everyone.

> [snip]
>>    QtGit commit -m $'R\xc3\xa9sum\xc3\xa9' 'R%E9sum%E9.txt'
>> 
>> Bash enables you to pass around byte sequences it doesn't understand.
>> That would help in getting UTF-8 into the application but would not help
>> pass in Latin1, as the application would (correctly) flag it as an
>> invalid UTF-8 sequence and replace 'é' with '?'. Instead, we need to
>> escape the file name in a way that will not be misinterpreted by bash or
>> command line processing, giving code that actually expects a path (say,
>> QFile, encodeName or QFileSystemEntry) a chance to interpret it.
>> 
>> For this to work, command line parsing need not understand more than
>> UTF-8, but encodeName would then be able to convert those \xHH sequences
>> to the proper Latin1-encoded "Résumé" and retrieve the right file from
>> the file system.
> 
> You used %HH instead of \xHH in your example, which is the URL encoding and 
> for which there are well-defined rules. I'd prefer that.

Indeed, my mistake mixing the two. That's what I get for going back and partially editing the e-mail ;-)

For the record, this was meant as simple contrived example, not to propose a final solution. If escaping is something we'd want to support we would need to spend some time looking at it.

>> We're talking about the edge cases, right? 99% of uses will work with or
>> without the need for encode/decodeName.
>> 
>> Currently, we don't support the edge cases at all. If we were able to
>> support the edge cases but not be able to interoperate with other
>> applications on those cases it would already be an improvement.
> 
> What we don't currently support are file names that cannot be decoded by the 
> locale codec, such as a latin1("Résumé.txt") on a UTF-8 system. If that's what 
> you mean by edge cases, then what's what I meant.
> 
> What I cannot agree with is sacrificing the normal case for the edge case. It 
> might be possible to fix the problem of the edge case above (a restricted 
> issue) with little fall-out, but I don't see a way of solving the generic 
> problem of filesystem encoding != locale encoding.

Assuming we can find a reversible non-intrusive way of escaping file names (I know, it is not a given that we can), normal cases wouldn't be affected. I think we can also consider slightly intrusive escaping as long as the trade-off to normal cases is acceptable.

>>> The way I see it, the only way that this could work is if two
>>> conditions were met:
>>> 
>>> 1) there is a cross-toolkit, lossless and encoding-independent
>>> representation of a filepath
>> 
>> This would be nice, agreed.
> 
> I'd say it's mandatory. If we only try to solve the restricted problem of 
> filenames outside the filesystem encoding (which is equal to the locale 
> encoding), there's a possible solution similar to Qt3's.

I don't understand why you equate (re-)introducing the ability to handle "funny" paths in Qt to fixing the world. The former can be part of the latter, but they're not the same problem. Heck, some of this is a self-inflicted problem that only exists in Qt because of the need to convert 8-bit to 16-bit encodings, and because we validate all UTF-8 byte sequences by default -- I'm not saying this is wrong, I don't think it is, but not all applications operate under the same constraints.

> If we want to solve the generic problem, then we must always transmit data in 
> a specific encoding between applications. That requires, in turn, that all 
> applications be modified to understand that for the common case (not just the 
> edge case).

Here's my wishlist:

    1) Read "funny" names from file system
    2) Present said "funny" names to user (may have unrecognized characters replaced with '?')
    3) Operate on files with such "funny" names
    4) Usefully pass said "funny" file names between different Qt APIs
    5) Save "funny" names to disk or configuration
    6) Read them back

For a long time in the Qt 4.x series we've have 1) and 2) through QDirIterator. You'd know a file exists and not be able to do much about it.

With QFileSystemEntry we now support 3) and can support 4) as long as the paths are not "downgraded" to QStrings and passed around as such. We don't necessarily need QFileSystemEntry to become public API, as long as QFile, QFileInfo, QDir, QDirIterator seamlessly interoperate (say, with conversion operators/constructors).

For 5) and 6) we need a non-intrusive reversible escaping mechanism. Non-intrusive so "normal" file names require no special handling and will Just Work(tm). %-encoding sounds interesting or we could get creative, say a triple-slash marks an encoded path segment: /home/joao///R%E9sum%E9.txt.

[snip]

> There is a possible solution for the edge case, implementable without 
> sacrificing the normal case, and which would cause little fall out in terms of 
> interoperability. But I really, really do not see how we can solve the generic 
> case of allowing the user to change the filesystem encoding at will. And by 
> user here, I mean both the developer using Qt by calling 
> QFile::setEncodingFunction as well as the end user toggling some configuration 
> switches in the system.

Hey, I did not mean to suggest that the filesystem encoding would change at will. I think filesystem encoding should be fixed and it is already so for some platforms, such as Windows and Mac. I find it odd that we'd use the locale for determining the encoding of the filesystem, essentially on Linux, but perhaps that is not a real problem in practice.

What is a real problem in practice and the one that (in my mind) setEncodingFunction addresses is not so much that of switching encodings, but that of allowing an escaping mechanism to be plugged in. Done this way, such escaping would be not only Qt-specific, but potentially application specific. Still, it should enable a simple File Manager built on Qt to operate on all files it sees.  

> To solve the edge case, we need to somehow store in a regular QString a byte 
> sequence that can be converted back to its original 8-bit form, regardless of 
> the locale encoding being used. If we say that changing the codecs themselves 
> in QTextCodec and QString is out of the question (that was the Qt 3 solution), 
> then the only place remaining is QFile::encodeName and QFile::decodeName. By 
> necessity, they must do more than just QString::{to,from}Local8Bit.

Yes. Those functions would handle escaping/unescaping, besides to/from(Local8Bit|Utf8).

> From there, we come to the conclusion that the QString representing such a file 
> name must contain special processing instructions (e.g., one or more special 
> characters). One form of special processing instruction is escaping each 
> character, like URLs do. The problem with the approach of escaping is what to 
> do when the escape character occurs in a file name. If that is a possibility, 
> the escape character needs to be escaped by itself (like "\\" for backslashes 
> in C or "%25" for percents in URLs). If we use this approach, then we will not 
> interoperate properly with non-Qt applications when this character happens.
> 
> The only sane solution, then, is to use a character that has a very small 
> chance of ever being used or, better yet, a zero chance (I don't think there's 
> any). If that happens, then this character will be close to "untypeable" on 
> the terminal. Not a big loss, I'd say.

We could use some magic sequence. Windows, for instance, uses the "\\?\" prefix to support longer paths. We could use '<' and '>', which are rare but valid, we could give a specific meaning to sequences of 3 or more slashes.

I don't have a concrete solution at the moment.

> In fact, I'd recommend that, instead of escaping each bad character, we escape 
> each path component (the escaped sequence ends at the next slash). That is, 
> suppose I am a Greek user and I unpacked a bad .zip file on the "My Documents" 
> folder, which is called:
> 
> 	/home/foo/έγγραφα
> 
> If my file was called "Résume.txt" in Latin1, the QString representing such a 
> file name would be:
> 	/home/foo/έγγραφα/<escape>Résumé.txt
> 
> If it was named "βιογραφικό σημείωμα.txt" in ISO-8859-7, the QString 
> representation would be:
> 	/home/foo/έγγραφα/<escape>âéïãñáöéêü óçìåßùìá.txt
> 
> That has the drawback of being hard to use when it comes to path manipulation. 
> Appending, prepending, extracting or inserting text could have unexpected 
> consequences.

I think any such scheme should support both absolute and relative paths and should allow a relative path to be combined with an absolute path with:

    absolute-path + '/' + relative-path

> An intermediate option would be to escape each sequence of non-locale 
> characters. Instead of the escaping ending at the slash, an unescape character 
> is necessary. For simplicity, let's say ⟪ shifts and ⟫ unshifts, the file could 
> be:
> 
> 	/home/foo/έγγραφα/⟪biocqavij|⟫ ⟪sgle_yla⟫.txt
> 
> 
> Pros:
> - implementable, with little fall-out for interoperability if we choose the 
> escape character well
> - probably survives a round-trip through the locale codec, so the escaping 
> isn't lost
> - since it survives the round-trip, it can be used across QProcess, on the 
> command-line, etc.
> - survives the user too, since it can be copy & pasted, edited, provided that 
> the escape characters remain

These are all good properties.

> Limitations:
> a) Qt-only, I don't expect anyone else to use such file names
> b) if encodeName() isn't used properly, it leads to a bad encoding of the file 
> name onto 8-bit. Applications dealing with the filesystem need to be extra 
> careful so as to not show two representations of the same file.
> c) for that matter, it's possible to produce an escaped form that matches a 
> regular file name
> d) double representations are often a source of security issues if not 
> handled carefully (cf. overlong sequences in UTF-8)

I don't see a) as such a big problem, since currently Qt can't even handle such file names. As for b) I think ideally we'd come up with something that makes the use of encode/decodeName invisible and doesn't require users to register their own encoding/decoding functions. c) is what we want to minimize.

As for d), if we make it all transparent and handled in a seamless way in Qt the problem that remains is how those paths interoperate with other applications and user code. It really helps to minimize c).

On the other hand we already have Qt-only paths in resource files and QDir::searchPaths(). We could easily use a well-known prefix for the special paths: url-encoded:/usr/joao/R%E9sum%E9.txt, which only supports absolute paths, but would already enable all items in my wish list.


> As you can see, I didn't come up with this today. I've known these 
> alternatives for years. I don't think they're worth our time.

Cheers,


João




More information about the Development mailing list