[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Thiago Macieira
thiago.macieira at intel.com
Sun Nov 17 01:55:32 CET 2019
Hi
Sorry, it looks like this thread is not progressing in a calm and reasoned
manner, the way it was meant to be. And I'm very much to blame. So I apologise
for the strong language and passionate opinions. I'm deleting most of what I
had written as a reply so we can start over.
Let's start with your questions:
On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:
> You have not yet answered
>
> - why this decision was made
You know, I don't know. To be frank, I don't know that a decision *was* made.
It all started with a change (see OP) about removing QTextCodec from the API
and from QtCore. It seemed reasonable enough but it turned up quite a few
kinks that hadn't been predicted. One of them, which may still be a
showstopper, is QXmlStreamReader's inability to handle XML data encoded in
anything except UTF-8, though a thorough search of all XML files in my system
turned up exactly zero such files.
I don't know why QTextCodec is being removed. I don't remember any decisions
in prior QtCS or this mailing list about removing it. We definitely discussed
removing the CJK codecs and their big tables and that can still be done, with
no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have
discussed removing it, but I don't remember a firm decision. And even if it is
firm, after looking at the consequences of doing so, we may want to reverse
our decision.
Related to that is the discussion of whether UTF-8 is the only acceptable
locale on Unix systems. If we don't have QTextCodec, then we have to have
something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8.
But even if we do have QTextCodec, that's still a reasonable question: should
assume it is UTF-8? And should we enforce it? Those were the questions in my
OP.
> - who did it
Considering I don't know a decision *was* made, I don't think we can say who
made it.
> - what the actual problem to solve was
Three things being tackled, all related:
1) QTextCodec in the API
I think we cannot do without it, it'll have to stay in one way or another. So
the question reduces to whether it should stay in QtCore or be moved to
another library. Given the QXmlStreamReader problem above, it's probably best
to keep it in QtCore, actually.
QTextCodec has some API limitations but they can be fixed. It's not necessary
for us to remove it: it's not *that* broken.
2) QtCore size
As I said above, removing the legacy codecs we have code for is not a problem.
They are already disabled in Qt builds where ICU is present, so we'd
additionally remove them from all other builds. Where ICU is present, there's
no loss of functionality for user applications, since ICU provides far more
codecs than we do. For those without ICU, it stands to reason that the user
chose size so they are aware of the limitations. Plus, one can always
instantiate their own QTextCodec and add to the list (at least, with today's
implementation).
If QTextCodec is not in QtCore, then most likely you can't affect how QtCore
and almost all other Qt classes decode 8-bit data into QString, including
QTextStream.
and 3) misconfigured locale systems and filename handling
This is probably the biggest problem. As it is right now, when the locale
isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode
any file names with the 8th bit set. Those file names are considered
filesystem corruption. And yet they are quite commonly created by the user
outside of English-speaking jurisdictions.
Your example of setting LC_ALL (or another environment variable) to force the
locale to print something that either can be parsed or shared is one such
problematic scenario. On one hand, you may need it to get some older tools to
parse output; on the other, it makes Qt applications unable to even see some
files exist.
> - why LC_*ALL* comes into play
Because it's the override. If we decide to override and LC_ALL is set, then we
have no choice but to override it. If it is unset, then we can leave it unset
too, but may need to override LC_CTYPE.
> I get the impression that this thread was not started as an RFC for an
> open-ended discussion, but as a staged attempt to provide a figleaf for
> a pre-determined decision.
That was not the intention. That's why I am re-starting it so we can come back
to a reasoned approach.
Anyway, the two independent (but related) decisions we need to make are:
1) do we keep QTextCodec in QtCore?
2) do we want to change we handle legacy (non-UTF8) locales?
For #2, the sub-questions of the OP apply:
a) What should Qt 6 assume the locale to be, if no locale is set?
b) In case a non-UTF-8 locale is set, what should we do?
c) Should we propagate our decision to child processes?
My preferences were:
a) C.UTF-8
b) override it to force UTF-8 on the same locale
c) yes
The reason for my preference in propagating to child processes is so that we
have a consistent protocol between parent and child. Moreover, the mechanism
for propagating to the child process is the same that prevent other code in
the same library from accidentally undoing our override (due to 2.b): qputenv.
I don't think that assuming the locale to be UTF-8 without using setlocale()
to inform such to the C library is acceptable. It would mean strerror() would
produce mojibake for us -- and since QString::fromLocal8Bit doesn't take
kindly to mojibake, in most languages qt_error_string() would return empty for
any and all error conditions. Just try ENOENT in ja_JP. Going further, I think
that if we change "ja_JP" to "ja_JP.UTF-8", we should set it in the
environment so that the child processes will produce "そのようなファイルやディレクトリはありません"
for ENOENT instead of undecodable mojibake.
Turns out, there's one locale that we can be sure that its non-UTF-8 default
is decodable under UTF-8 and that'st he "C" locale. So we don't *have* to
qputenv "C.UTF-8" if the locale is explicitly "C" (as opposed to being unset).
But I think we should. My arguments are that UTF-8 locales are the default in
all desktop Linux distributions, all BSDs and on macOS and have been for 15
years. Most embedded systems from the last 5 years at least also have it as
the default, especially those with graphical HMIs and most especially those
using Qt for that. Any applications that had problems with UTF-8 must have
been fixed for a long time and those that didn't are almost certainly launched
from wrappers that set a suitable environment for them, either via
QProcessEnvironment, execle, a shell script, or some other mechanism.
Moreover, setting the locale to non-UTF-8 on a Qt 4 or 5 application on a
system with UTF-8-encoded file names is just *wrong* and asking for trouble,
for the filesystem reasons stated above. Just as an example, think of an
embedded system with a multimedia player that reads a FAT32-formatted USB
stick: it wouldn't go very far if it couldn't even see the music files with
non-ASCII characters in them. So I feel confident when I say applications
targetting porting to Qt 6 are not subject to that problem. Therefore, our
resetting of the environment inside the Qt 6 application is not going to
affect the chiid processes.
But if we disagree and think we shouldn't qputenv, I still think we should
assume by default the locale *is* UTF-8, even if the environment tells us it
isn't (an explict LANG=ja_JP for example, but much more commonly an LC_ALL=C
override). The changing of the encoding is usually an undesired side-effect,
not an intentional choice. That is to say, LANG=ja_JP was actually meant to be
LANG=ja_JP.UTF-8 and LC_ALL=C could have been for the parsing reasons you
brought up. If we don't do the qputenv(), we'll still setlocale() in
QCoreApplication so qt_error_string() produces output and we'll live with the
danger that some code does our choice. My search through Linux library code
found no instance of a permanent setlocale() call with a non-null second
parameter (Qt is actually the only exception).
I hope this clarifies things and we're back at a rational discussion.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel System Software Products
More information about the Development
mailing list