[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Sun Nov 17 01:55:32 CET 2019

Hi

Sorry, it looks like this thread is not progressing in a calm and reasoned 
manner, the way it was meant to be. And I'm very much to blame. So I apologise 
for the strong language and passionate opinions. I'm deleting most of what I 
had written as a reply so we can start over.

Let's start with your questions:

On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:
> You have not yet answered
> 
>   - why this decision was made

You know, I don't know. To be frank, I don't know that a decision *was* made. 
It all started with a change (see OP) about removing QTextCodec from the API 
and from QtCore. It seemed reasonable enough but it turned up quite a few 
kinks that hadn't been predicted. One of them, which may still be a 
showstopper, is QXmlStreamReader's inability to handle XML data encoded in 
anything except UTF-8, though a thorough search of all XML files in my system 
turned up exactly zero such files.

I don't know why QTextCodec is being removed. I don't remember any decisions 
in prior QtCS or this mailing list about removing it. We definitely discussed 
removing the CJK codecs and their big tables and that can still be done, with 
no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have 
discussed removing it, but I don't remember a firm decision. And even if it is 
firm, after looking at the consequences of doing so, we may want to reverse 
our decision.

Related to that is the discussion of whether UTF-8 is the only acceptable 
locale on Unix systems. If we don't have QTextCodec, then we have to have 
something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8. 
But even if we do have QTextCodec, that's still a reasonable question: should 
assume it is UTF-8? And should we enforce it? Those were the questions in my 
OP.

>   - who did it

Considering I don't know a decision *was* made, I don't think we can say who 
made it.

>   - what the actual problem to solve was

Three things being tackled, all related:

1) QTextCodec in the API
I think we cannot do without it, it'll have to stay in one way or another. So 
the question reduces to whether it should stay in QtCore or be moved to 
another library. Given the QXmlStreamReader problem above, it's probably best 
to keep it in QtCore, actually.

QTextCodec has some API limitations but they can be fixed. It's not necessary 
for us to remove it: it's not *that* broken.

2) QtCore size
As I said above, removing the legacy codecs we have code for is not a problem. 
They are already disabled in Qt builds where ICU is present, so we'd 
additionally remove them from all other builds. Where ICU is present, there's 
no loss of functionality for user applications, since ICU provides far more 
codecs than we do. For those without ICU, it stands to reason that the user 
chose size so they are aware of the limitations. Plus, one can always 
instantiate their own QTextCodec and add to the list (at least, with today's 
implementation).

If QTextCodec is not in QtCore, then most likely you can't affect how QtCore 
and almost all other Qt classes decode 8-bit data into QString, including 
QTextStream.

and 3) misconfigured locale systems and filename handling
This is probably the biggest problem. As it is right now, when the locale 
isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode 
any file names with the 8th bit set. Those file names are considered 
filesystem corruption. And yet they are quite commonly created by the user 
outside of English-speaking jurisdictions.

Your example of setting LC_ALL (or another environment variable) to force the 
locale to print something that either can be parsed or shared is one such 
problematic scenario. On one hand, you may need it to get some older tools to 
parse output; on the other, it makes Qt applications unable to even see some 
files exist.

>   - why LC_*ALL* comes into play

Because it's the override. If we decide to override and LC_ALL is set, then we 
have no choice but to override it. If it is unset, then we can leave it unset 
too, but may need to override LC_CTYPE.

> I get the impression that this thread was not started as an RFC for an
> open-ended discussion, but as a staged attempt to provide a figleaf for
> a pre-determined decision.

That was not the intention. That's why I am re-starting it so we can come back 
to a reasoned approach.

Anyway, the two independent (but related) decisions we need to make are:
1) do we keep QTextCodec in QtCore?
2) do we want to change we handle legacy (non-UTF8) locales?

For #2, the sub-questions of the OP apply:
 a) What should Qt 6 assume the locale to be, if no locale is set?
 b) In case a non-UTF-8 locale is set, what should we do?
 c) Should we propagate our decision to child processes?

My preferences were:
 a) C.UTF-8
 b) override it to force UTF-8 on the same locale
 c) yes

The reason for my preference in propagating to child processes is so that we 
have a consistent protocol between parent and child. Moreover, the mechanism 
for propagating to the child process is the same that prevent other code in 
the same library from accidentally undoing our override (due to 2.b): qputenv.

I don't think that assuming the locale to be UTF-8 without using setlocale() 
to inform such to the C library is acceptable. It would mean strerror() would 
produce mojibake for us -- and since QString::fromLocal8Bit doesn't take 
kindly to mojibake, in most languages qt_error_string() would return empty for 
any and all error conditions. Just try ENOENT in ja_JP. Going further, I think 
that if we change "ja_JP" to "ja_JP.UTF-8", we should set it in the 
environment so that the child processes will produce "そのようなファイルやディレクトリはありません" 
for ENOENT instead of undecodable mojibake.

Turns out, there's one locale that we can be sure that its non-UTF-8 default 
is decodable under UTF-8 and that'st he "C" locale. So we don't *have* to 
qputenv "C.UTF-8" if the locale is explicitly "C" (as opposed to being unset).

But I think we should. My arguments are that UTF-8 locales are the default in 
all desktop Linux distributions, all BSDs and on macOS and have been for 15 
years. Most embedded systems from the last 5 years at least also have it as 
the default, especially those with graphical HMIs and most especially those 
using Qt for that. Any applications that had problems with UTF-8 must have 
been fixed for a long time and those that didn't are almost certainly launched 
from wrappers that set a suitable environment for them, either via 
QProcessEnvironment, execle, a shell script, or some other mechanism. 

Moreover, setting the locale to non-UTF-8 on a Qt 4 or 5 application on a 
system with UTF-8-encoded file names is just *wrong* and asking for trouble, 
for the filesystem reasons stated above. Just as an example, think of an 
embedded system with a multimedia player that reads a FAT32-formatted USB 
stick: it wouldn't go very far if it couldn't even see the music files with 
non-ASCII characters in them. So I feel confident when I say applications 
targetting porting to Qt 6 are not subject to that problem. Therefore, our 
resetting of the environment inside the Qt 6 application is not going to 
affect the chiid processes.

But if we disagree and think we shouldn't qputenv, I still think we should 
assume by default the locale *is* UTF-8, even if the environment tells us it 
isn't (an explict LANG=ja_JP for example, but much more commonly an LC_ALL=C 
override). The changing of the encoding is usually an undesired side-effect, 
not an intentional choice. That is to say, LANG=ja_JP was actually meant to be 
LANG=ja_JP.UTF-8 and LC_ALL=C could have been for the parsing reasons you 
brought up. If we don't do the qputenv(), we'll still setlocale() in 
QCoreApplication so qt_error_string() produces output and we'll live with the 
danger that some code does our choice. My search through Linux library code 
found no instance of a permanent setlocale() call with a non-null second 
parameter (Qt is actually the only exception).

I hope this clarifies things and we're back at a rational discussion.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products