[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Sat Nov 16 01:23:24 CET 2019

On Thu, Nov 14, 2019 at 11:20:08PM -0800, Thiago Macieira wrote:
> On Thursday, 14 November 2019 13:27:23 PST André Pönitz wrote:
> > *Within* a Qt application consisting of Qt library, other libraries,
> > and actual user code it's mildly presumptous for one library to impose
> > random unnecessay restrictions on user code and other libraries.
>
> That boat sailed 20 years ago when we started calling setlocale() from
> QCoreapplication. We set the locale, period.

1. I was refering to putenv, not setlocale.

2. Even for setlocale, the point is not _whether_ it is called, but _how_.
   setlocale(..., 0) e.g. only queries, does not change anything.

   QCoreapplication currently calls setlocale(LC_ALL, "").

   This is fine. This accepts the user's choice of environment as authorative.

   It also works well in practice. I can run something like

       LC_PAPER=de_LU LC_TIME=en_US.UTF-8 LC_COLLATE=C qtcreator

   and it will not only "just work" for the application itself, but
   also be properly passed on to e.g. a terminal started from within.

   So no boat has sailed, let alone 20 years ago.

   The boat _will_ sail once there when you put a non-empty string there,
   overriding user's choice.

> The questions are:
> 1) do we want to prevent another library from accidentally unsetting it?
> 2) do we want child processes to use the same?
>
> Note the answers for both questions must be the same, for the solution is the
> same. So either both yeses or both nos.

This "answers for both questions must be the same" requirement is arbitrary.

The fact that one known solution results in same answers to both is in 
no way proof that no other solutions exist.

But it looks like there's no need to discuss _that_, as my answers are
"no" and "no". 

> > Making assumptions on the controlability of content of a input stream is
> > questionable. The proposed method of changing the environment for child
> > processes is no guarantee on what the child actually produces, and the
> > Qt application still has to be prepared to handle non-Utf-8 or otherwise
> > "broken" input.
>
> Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You can
> either deal with binary data or with UTF-8 text, there's no middle ground.

Now that's an interesting twist.

The latest memo I did (not...) get was that codecs are to be moved into a separate
module. Which is actually ok, as it allows user code using codecs to live
on with minimal changes, and makes QtCore slimmer, kind of "no-loss + win".

"Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is
definitely news to me. I've not seen this being discussed, neither here nor
within the part of the company that I usually talk to.

So when and where was this decision made, by whom, and why?

Did that person bother to check e.g. whether Qt Creator uses non-UTF-8
codecs in some cases and did that person come to the conclusion that any
such use is bad and deserves to die?

> > This discussion so far claimed the existance of a range of problems
> > without giving an actual example. Then it goes on to propose a shotgut
> > approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings"
> > like categories that are a bit more fine grained than LC_ALL?  Bear with
> > me when I do not have the impression that Qt will be the right context
> > to accept such "obligations".
>
> The same argument can be made for your statements:

Sure, one could do that.

But that would _my_ argument not make go away, nor compensate for the current lack
of answers to the questions I asked.

And ...

> you're arguing that here are broken applications that won't handle
> C.UTF-8 correctly, without giving as single example.

... is of course not true:

1. I did not claim there were "broken" applications that won't handle
   C.UTF-8 "correctly", I claimed that there are applications that react
   differently to C.UTF-8. 

2. I _did_ give two examples. I can repeat here:

   2.1) https://lists.qt-project.org/pipermail/development/2019-November/037815.html

     gcc produces different output under C and C.UTF-8:

     echo x | LC_CTYPE=C gcc -xc -
      <stdin>:1:1: error: expected '=', ',', ';', 'asm' or '__attribute__' at end of input

     echo x | LC_CTYPE=C.UTF-8 gcc -xc -
      <stdin>:1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ at end of input

     As an additional twist, this different behaviour does not require fancy
     input, input is plain ASCII in both cases.

     Output parsers expecting "'" e.g. to produce a set recommendations how to quick-fix
     such problems in an IDE will break.

   2.2) https://lists.qt-project.org/pipermail/development/2019-November/037810.html

     #include <locale.h>
     #include <string.h>
     #include <stdlib.h>

     int main()
     {   
         if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0)
             abort();
     }

     runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8".

     While contreived in this form, there _is_ code even in Creator checking
     for "C" literally, raising the suspicion that this might happen in other
     applications, too.

> I think the whole problem is that we're trying to talk about broken
> applications and the way their brokenness manifests itself. I don't think such
> applications exist anymore in occurrence sufficient for us to deal with.

_I_ am not trying to talk about broken applications, I am talking about applications
that currently do work as intended by their authors which will break when Qt 6 forces
them to run them under a LC_ALL=xxxx.UTF-8

> Anyway, since you oppose setting the environment, let's just make a check for
> assumption:
>
> if (locale is not UTF-8)
>     qFatal("Qt only supports UTF-8 locales. "
>            "Please configure your system properly");

So instead of trying to understand, or even only to accept that other kids use
their toys differently than we think they should, we'll break their toys. Yay.

Andre'

PS: There was an implied question in the preceding mails why we consider using
LC_ALL when e.g. changing LC_CTYPE alone already is capabale of doing all the damage.

This was apparently not enough to warrant an answer.

So let me spell it out:

    Why do we want to use LC_ALL specifically?