[Development] Deprecating QFile::encodeName/decodeName

Tue Jun 5 16:48:58 CEST 2012

Thiago Macieira wrote:
> On terça-feira, 5 de junho de 2012 13.20.36, João Abecasis wrote:
>> I would not go as far as saying that the concept is broken, but it
>> definitely isn't portable. One big issue is that nowadays we don't
>> use an 8-bit encoding on Windows and those functions assume QString
>> <-> QByteArray conversions.
> 
> I called it broken due to the ambiguity when it comes to interpreting
> data that may or may not contain file names, like command-line
> arguments. See the next paragraph of what I had written.
> 
> Codecs must never be used in a situation where there's an ambiguity on
> what the encoding is. That will lead to mojibake sooner or later.

That's the whole point, you shouldn't treat file names differently when
processing the command line or configuration files. Those should be
interpreted using their specific rules: command line goes according to
locale, configuration goes according to established convention (say,
UTF-8).

On top of that you would use some sort of escaping to specify byte
sequences not possible with the encoding of the context. This escaping
mechanism does not need to be understood or interpreted while reading
the configuration file or processing the command line. Only at the point
where you actually want to use that string as a file path.

[snip]

> My problem is when the information that this chunk of 8-bit data is a
> file path is lost, such as in the command-line. Take the following
> case:
> 
>     git commit -m Résumé Résumé.txt
> 
> Here we have the same Unicode codepoint (U+00E9) appearing in two
> different contexts: a text message and a file name. If the filesystem
> encoding is different from the locale encoding, then those two
> parameters will potentially use two different byte sequences to
> represent the same codepoint (and, of course, on the shell you would
> not see "é" in both cases).

Another way to consider this is to add further constraints. Let's, for
the sake of argument, assume i) you can only type ASCII characters
(broken keyboard or something); ii) command line processing expects
UTF-8 (your locale); iii) file name on disk is actually in Latin1.

How do you execute the same command, then?

[ iv) assuming bash as shell ]

    QtGit commit -m $'R\xc3\xa9sum\xc3\xa9' 'R%E9sum%E9.txt'

Bash enables you to pass around byte sequences it doesn't understand.
That would help in getting UTF-8 into the application but would not help
pass in Latin1, as the application would (correctly) flag it as an
invalid UTF-8 sequence and replace 'é' with '?'. Instead, we need to
escape the file name in a way that will not be misinterpreted by bash or
command line processing, giving code that actually expects a path (say,
QFile, encodeName or QFileSystemEntry) a chance to interpret it.

For this to work, command line parsing need not understand more than
UTF-8, but encodeName would then be able to convert those \xHH sequences
to the proper Latin1-encoded "Résumé" and retrieve the right file from
the file system.

> If we had a proper command-line argument class, one could write code
> like:
> 
>     QString message = args.textAt(1);
>     QString filename = args.filenameAt(2);

No the functionality you propose as args.filenameAt(2) was being offered
by encodeName in QFile/QFileInfo/QDir. You don't need special handling
at this point.

> That implies that this command-line parsing class needs to keep the
> raw representation and convert on demand, or it must do double-keeping
> of the data. What's more, it requires that the developer know that the
> item at a given position is a filename, not text, which is the basis
> of my problem: this information is lost quite easily.

No, it requires that those who do type paths on the command line are
able and know how to pass them through to the application.

It means you will lose the benefit of auto-completion (not every shell
will understand Qt's escaping mechanism) for files with "funny" names.
It also means you'd be able to access those files, nonetheless.
Currently, you can't.

[snip]

> The Qt 3 solution for file name encodings was to modify the QString
> UTF-8 decoder and encoder so that QString could keep invalid
> sequences, which was a major violation of UTF-8 compliance and a
> security issue. What's more, it didn't completely solve the problem
> since those strings could be copied to non-Qt programs that would not
> understand them (e.g., from a file manager to xterm).

We're talking about the edge cases, right? 99% of uses will work with or
without the need for encode/decodeName.

Currently, we don't support the edge cases at all. If we were able to
support the edge cases but not be able to interoperate with other
applications on those cases it would already be an improvement.

> The way I see it, the only way that this could work is if two
> conditions were met:
> 
> 1) there is a cross-toolkit, lossless and encoding-independent
> representation of a filepath

This would be nice, agreed.

> 2) that all applications use this representation and no user or
> developer lose the representation.
> 
> I believe there is such a solution for #1: URLs. URLs can losslessly
> represent all possible paths and, in their base form (RFC 3986), they
> are also ASCII- compatible, which would make them immune to encoding
> changes. I can easily add methods to QUrl to return the "native" form
> of a local path: locale encoding on Unix, UTF-8 NFD on Mac, UTF-16 on
> Windows.
> 
> However, I don't see how to fix #2. Take the example above: git does
> not understand URLs. So if I had in my .desktop file:
> 
> [Desktop Entry]
> Encoding=UTF-8
> # the third argument to git commit is a URI reference for a filename 
> # called "Résumé.txt" encoded in Latin 1
> Exec=git commit -m Résumé R%E9sum%E9.txt
> 
> without modifications, git would not understand that representation
> and would not find the file on disk. Changing *every* *single*
> application under the sun to use a different representation for file
> paths than what they do today is not feasible.

What a lot of applications will do is treat byte-sequences (strings)
agnostically and not validate them as valid UTF-8 sequences, allowing
you to still access files with those "funny" names.

While I agree with a lot of what you are saying about what is reasonable
and feasible for encodings of file names (heck, everyone should be using
UTF-8 or shot on the spot!), Qt as a toolkit should not lock you out of
files that mistakenly or not ended up with those "funny" names. One
consequence of using locale to decide how to encode file names is that
it is all too easy to come across those files (e-mail, file sharing, USB
sticks and whatnot).

Again, as a general purpose toolkit Qt needs to allow you to read,
rename and delete those files. It should potentially allow you to store
their names and come back to them at a later time.

Maybe QUrl is part of the solution. The Mac seems to already show
%-encoding for "funny" names.

Cheers,

João