[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Mon Nov 18 00:12:19 CET 2019

Il 17/11/19 01:55, Thiago Macieira ha scritto:
> Hi
> 
> Sorry, it looks like this thread is not progressing in a calm and reasoned
> manner, the way it was meant to be. And I'm very much to blame. So I apologise
> for the strong language and passionate opinions. I'm deleting most of what I
> had written as a reply so we can start over.
> 
> Let's start with your questions:
> 
> On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:
>> You have not yet answered
>>
>>    - why this decision was made
> 
> You know, I don't know. To be frank, I don't know that a decision *was* made.
> It all started with a change (see OP) about removing QTextCodec from the API
> and from QtCore. It seemed reasonable enough but it turned up quite a few
> kinks that hadn't been predicted. One of them, which may still be a
> showstopper, is QXmlStreamReader's inability to handle XML data encoded in
> anything except UTF-8, though a thorough search of all XML files in my system
> turned up exactly zero such files.
> 
> I don't know why QTextCodec is being removed. I don't remember any decisions
> in prior QtCS or this mailing list about removing it. We definitely discussed
> removing the CJK codecs and their big tables and that can still be done, with
> no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have
> discussed removing it, but I don't remember a firm decision. And even if it is
> firm, after looking at the consequences of doing so, we may want to reverse
> our decision.

I don't know either. Is it to make QtCore smaller? Wasn't the feature 
system ("Qt Lite") supposed to address that? Or is it to make it less of 
a "kitchen sink", and split it in smaller libraries? Could that mean 
having QTextCodec in its own library, and QXmlStreamReader in another 
(that depends on the former)?

> Related to that is the discussion of whether UTF-8 is the only acceptable
> locale on Unix systems. If we don't have QTextCodec, then we have to have
> something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8.
> But even if we do have QTextCodec, that's still a reasonable question: should
> assume it is UTF-8? And should we enforce it? Those were the questions in my
> OP.

Should fromLocal8Bit be following the locale environment instead 
(LC_CTYPE, LC_MESSAGES or similar)?

> 2) QtCore size
> As I said above, removing the legacy codecs we have code for is not a problem.
> They are already disabled in Qt builds where ICU is present, so we'd
> additionally remove them from all other builds. Where ICU is present, there's
> no loss of functionality for user applications, since ICU provides far more
> codecs than we do. For those without ICU, it stands to reason that the user
> chose size so they are aware of the limitations. Plus, one can always
> instantiate their own QTextCodec and add to the list (at least, with today's
> implementation).
> 
> If QTextCodec is not in QtCore, then most likely you can't affect how QtCore
> and almost all other Qt classes decode 8-bit data into QString, including
> QTextStream.

See above -- it also means QTextStream goes in some I/O lib that 
contains or depends on the codecs lib.

> and 3) misconfigured locale systems and filename handling
> This is probably the biggest problem. As it is right now, when the locale
> isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode
> any file names with the 8th bit set. Those file names are considered
> filesystem corruption. And yet they are quite commonly created by the user
> outside of English-speaking jurisdictions.

Why do we bother about "saving the world"? A misconfigured system is the 
user's mistake. They should be in charge of fixing it in order to 
address the problem.

> 
>> I get the impression that this thread was not started as an RFC for an
>> open-ended discussion, but as a staged attempt to provide a figleaf for
>> a pre-determined decision.
> 
> That was not the intention. That's why I am re-starting it so we can come back
> to a reasoned approach.
> 
> Anyway, the two independent (but related) decisions we need to make are:
> 1) do we keep QTextCodec in QtCore?
> 2) do we want to change we handle legacy (non-UTF8) locales?
> 
> For #2, the sub-questions of the OP apply:
>   a) What should Qt 6 assume the locale to be, if no locale is set?
>   b) In case a non-UTF-8 locale is set, what should we do?
>   c) Should we propagate our decision to child processes?
> 
> My preferences were:
>   a) C.UTF-8
>   b) override it to force UTF-8 on the same locale
>   c) yes

How about

a) either C / C.UTF-8, but warning the user; but I'd up the ante, and 
say: just assert/crash.

b) keep the choice. Silently changing it sounds like a bad idea; we 
should never override the user choices silently.

c) no. We shouldn't "fix" subprocesses. They have the right to make 
their own independent decisions.

> But I think we should. My arguments are that UTF-8 locales are the default in
> all desktop Linux distributions, all BSDs and on macOS and have been for 15
> years. Most embedded systems from the last 5 years at least also have it as
> the default, especially those with graphical HMIs and most especially those
> using Qt for that. Any applications that had problems with UTF-8 must have
> been fixed for a long time and those that didn't are almost certainly launched
> from wrappers that set a suitable environment for them, either via
> QProcessEnvironment, execle, a shell script, or some other mechanism.

Or, on the other hand: what is the chance that a system comes without a 
locale set? What is more likely to conclude, that it's an accident or a 
deliberate setting? If it's an accident, why not being *very* verbose 
about it?

> Moreover, setting the locale to non-UTF-8 on a Qt 4 or 5 application on a
> system with UTF-8-encoded file names is just *wrong* and asking for trouble,
> for the filesystem reasons stated above. Just as an example, think of an
> embedded system with a multimedia player that reads a FAT32-formatted USB
> stick: it wouldn't go very far if it couldn't even see the music files with
> non-ASCII characters in them. So I feel confident when I say applications
> targetting porting to Qt 6 are not subject to that problem. Therefore, our
> resetting of the environment inside the Qt 6 application is not going to
> affect the chiid processes.
> 
> But if we disagree and think we shouldn't qputenv, I still think we should
> assume by default the locale *is* UTF-8, even if the environment tells us it
> isn't (an explict LANG=ja_JP for example, but much more commonly an LC_ALL=C
> override). The changing of the encoding is usually an undesired side-effect,
> not an intentional choice. That is to say, LANG=ja_JP was actually meant to be
> LANG=ja_JP.UTF-8 and LC_ALL=C could have been for the parsing reasons you
> brought up. If we don't do the qputenv(), we'll still setlocale() in
> QCoreApplication so qt_error_string() produces output and we'll live with the
> danger that some code does our choice. My search through Linux library code
> found no instance of a permanent setlocale() call with a non-null second
> parameter (Qt is actually the only exception).

Qt is a "framework", not a "library". :-)

-- 
Giuseppe D'Angelo | giuseppe.dangelo at kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4329 bytes
Desc: Firma crittografica S/MIME
URL: <http://lists.qt-project.org/pipermail/development/attachments/20191118/3e175026/attachment.bin>