[Development] Deprecating QFile::encodeName/decodeName

Thiago Macieira thiago.macieira at intel.com
Tue Jun 5 15:01:51 CEST 2012


On terça-feira, 5 de junho de 2012 13.20.36, João Abecasis wrote:
> If they're no-op and we have no intention of reviving them, then this is
> the right time to completely remove them. Carrying around this cruft
> for no good reason doesn't help anyone.

We can remove them like we did QTextCodec::setCodecForCStrings. The number of 
people using those functions in the world must be countable in the fingers of 
one hand.

> > Rationale:
> > 
> > Those two functions have been present in Qt since at least Qt 2 (see
> > [1] and [2]). Their purpose was to convert the UTF-16-based QString
> > filenames to the local filesystem encodings on Unix systems. Back in
> > those days, some people had file names encoded with a different
> > encoding than the locale -- for example, UTF-8 for the file names and
> > Latin 1 or 15 for the data, or vice versa.
> > 
> > For that reason, Qt developers should use QFile::encodeName when
> > converting a QString for use in the POSIX C function calls, and
> > similarly use QFile::decodeName when converting data from the POSIX
> > calls to QString.
> > 
> > That concept is broken.
> 
> I would not go as far as saying that the concept is broken, but it
> definitely isn't portable. One big issue is that nowadays we don't use
> an 8-bit encoding on Windows and those functions assume QString <->
> QByteArray conversions.

I called it broken due to the ambiguity when it comes to interpreting data 
that may or may not contain file names, like command-line arguments. See the 
next paragraph of what I had written.

Codecs must never be used in a situation where there's an ambiguity on what 
the encoding is. That will lead to mojibake sooner or later.

> > The POSIX calls are not the only source of strings. There are more,
> > like for example the command-line and data found in files. When
> > parsing the command- line, QCoreApplication applies
> > QString::fromLocal8Bit indiscriminately, since it doesn't know which
> > arguments refer to files and which ones don't. If you're reading from
> > a file, you're likely to do the same or to use QTextStream, which
> > amounts to the same problem.
> > 
> > Moreover, if you're going to print a file name to a file, what do you
> > use? When communicating with other programs, what encoding should be
> > used? How about reporting information on the terminal (stdout and
> > stderr)? Similar to QCoreApplication, when you set file names as part
> > of the arguments in QProcess indiscriminately applies toLocal8Bit,
> > since it doesn't know what is a file name and what isn't.
> 
> Here, you are mixing different things: paths on the filesystem, command
> line arguments and configuration settings. In the end it's all about
> strings and each must be interpreted in the context it lives in and not
> what it represents. I find it helpful to think of these things in terms
> of both multi-user systems and users that switch locales (I sometimes do
> this, myself).
> 
> 
> File system paths are meant to survive reboots, and changes of user and
> locale. As such, they should follow a (system-wide) predefined encoding.
> Typically, they're no more than a null-terminated sequence of bytes
> (2-bytes on Windows), where a reserved few are actually meaningful (say,
> '/', ':' and '\0').

Agreed.

> Configuration settings will typically belong to an application and may
> or may not be user specific. They're at least meant to survive
> application restarts. The responsibility for defining encoding for these
> lies with the application that owns them. System-wide settings used by
> multiple applications typically follow some standard or de facto
> convention for encodings.

Assuming the application knows that the entry in question is a file path, I 
don't see a problem. This worked in Qt 4 too.

> Arguments specified on the command line are short-lived. They belong to
> the current session and should be interpreted according to the current
> locale. Applications should assume the user typed those straight from
> the shell.

My problem is when the information that this chunk of 8-bit data is a file path 
is lost, such as in the command-line. Take the following case:

	git commit -m Résumé Résumé.txt

Here we have the same Unicode codepoint (U+00E9) appearing in two different 
contexts: a text message and a file name. If the filesystem encoding is different 
from the locale encoding, then those two parameters will potentially use two 
different byte sequences to represent the same codepoint (and, of course, on 
the shell you would not see "é" in both cases).

If we had a proper command-line argument class, one could write code like:

	QString message = args.textAt(1);
	QString filename = args.filenameAt(2);

That implies that this command-line parsing class needs to keep the raw 
representation and convert on demand, or it must do double-keeping of the 
data. What's more, it requires that the developer know that the item at a 
given position is a filename, not text, which is the basis of my problem: this 
information is lost quite easily.

> [ If you save the command line in a configuration (say .desktop) file,
> then your shell should read it according to the rules of such file and
> convert it to current locale before invoking execv.

How?

Imagine:
[Desktop Entry]
Encoding=UTF-8
Exec=git commit -m Résumé Résumé.txt

*How* will it convert that UTF-8 data to 8-bit? How will the desktop parser 
know that one of the arguments is text and the other is a filename? We can't 
very well write the file name's raw binary code there because of that 
"Encoding" line above: the .desktop file would not parse as UTF-8 and could not 
be edited in a regular text editor.

For that matter, let's take this to QProcess:
	arguments << "-m" << "Résumé" << "Résumé.txt";
	process.start("git", arguments);

How will QProcess know that arguments.at(1) should be encoded using 
QString::toLocal8Bit() and at(2) should go through QFile::encodeName()?


> And now for the problem encode/decodeName pretends to solve...
> 
> Whatever the encoding that's being used for command line arguments and
> configuration settings, applications should support an escaping or
> encoding mechanism to allow access file system paths that can't be
> directly represented in their own encodings. Since we don't provide one
> ourselves we need to enable applications to do it.

We don't provide that because we use QString for file names. We tried to 
provide that (from Qt 3 to Qt 4.3 IIRC), but it caused compatibility issues 
elsewhere.

> > Therefore, I call this concept broken. There's only one possible
> > encoding for the file system and it's the locale's encoding.
> 
> No, the concept is not broken. The implementation we had failed to
> adequately support the cross-platform use case and, in a way, missed the
> point. It didn't support Unicode-Windows, I don't think it was being
> used on Mac and elsewhere there were spots that didn't consistently use
> the API (I hope we fixed most of those in the 4.8 series, though).

The implementation was flawed, sure. But I still think that having a filesystem 
encoding different from the locale encoding is a broken concept. You run into 
ambiguity issues far too easily for it to be workable.

> Longer term, I think part of the solution is to expose the internal
> QFileSystemEntry API and to primarily use that for passing paths around.
> It might be nice to offer a default escaping and encoding mechanism to
> reversibly express arbitrary byte sequences in plain (ASCII) text. That
> would remove the need for users to come up with their own solutions to a
> real problem.

I've given a QFileName / QFilePath / QFileSystemEntry class a thought more 
than once. In the end, I do not think it will solve the problem of the 
encoding at all. It might solve other problems, though.

The big issue is that people still take QStrings out of them, whereby they 
cause the information that this piece of string was a file path to be lost. 
Even if we modify QProcess's API to understand that its arguments are either 
QString or QFileSystemEntry, we will still have people using QStrings for 
filename arguments because they came from elsewhere up in their stacks.

The Qt 3 solution for file name encodings was to modify the QString UTF-8 
decoder and encoder so that QString could keep invalid sequences, which was a 
major violation of UTF-8 compliance and a security issue. What's more, it 
didn't completely solve the problem since those strings could be copied to 
non-Qt programs that would not understand them (e.g., from a file manager to 
xterm).

The way I see it, the only way that this could work is if two conditions were 
met:

1) there is a cross-toolkit, lossless and encoding-independent representation 
of a filepath

2) that all applications use this representation and no user or developer lose 
the representation.

I believe there is such a solution for #1: URLs. URLs can losslessly represent 
all possible paths and, in their base form (RFC 3986), they are also ASCII-
compatible, which would make them immune to encoding changes. I can easily add 
methods to QUrl to return the "native" form of a local path: locale encoding 
on Unix, UTF-8 NFD on Mac, UTF-16 on Windows.

However, I don't see how to fix #2. Take the example above: git does not 
understand URLs. So if I had in my .desktop file:

[Desktop Entry]
Encoding=UTF-8
# the third argument to git commit is a URI reference for a filename 
# called "Résumé.txt" encoded in Latin 1
Exec=git commit -m Résumé R%E9sum%E9.txt

without modifications, git would not understand that representation and would 
not find the file on disk. Changing *every* *single* application under the sun 
to use a different representation for file paths than what they do today is not 
feasible.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
     Intel Sweden AB - Registration Number: 556189-6027
     Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120605/bf932fa2/attachment.sig>


More information about the Development mailing list