[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Thiago Macieira thiago.macieira at intel.com
Sat Nov 16 02:47:04 CET 2019


On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote:
> > The questions are:
> > 1) do we want to prevent another library from accidentally unsetting it?
> > 2) do we want child processes to use the same?
> > 
> > Note the answers for both questions must be the same, for the solution is
> > the same. So either both yeses or both nos.
> 
> This "answers for both questions must be the same" requirement is arbitrary.
> 
> The fact that one known solution results in same answers to both is in
> no way proof that no other solutions exist.

I don't see how to prevent another library doing setlocale(LC_ALL, "") from 
not overriding Qt's default other than to make setlocale(LC_ALL, "") do what 
we want. Since what it does is read the environment, the only solution is to 
change the environment.

> > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You
> > can either deal with binary data or with UTF-8 text, there's no middle
> > ground.
> Now that's an interesting twist.
> 
> The latest memo I did (not...) get was that codecs are to be moved into a
> separate module. Which is actually ok, as it allows user code using codecs
> to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss
> + win".

Sure. But that's no different than using ICU or writing your own code to 
convert from binary to text. QString will not support it on its own.

> "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is
> definitely news to me. I've not seen this being discussed, neither here nor
> within the part of the company that I usually talk to.

You just said yourself, above. If QTextCodec moves to another library, we have 
no codecs in QtCore. That means the rest of Qt will not support other codecs.

> So when and where was this decision made, by whom, and why?
> 
> Did that person bother to check e.g. whether Qt Creator uses non-UTF-8
> codecs in some cases and did that person come to the conclusion that any
> such use is bad and deserves to die?

Probably not. Why does Qt Creator need other codecs?

> > you're arguing that here are broken applications that won't handle
> > C.UTF-8 correctly, without giving as single example.
> 
> ... is of course not true:
> 
> 1. I did not claim there were "broken" applications that won't handle
>    C.UTF-8 "correctly", I claimed that there are applications that react
>    differently to C.UTF-8.

Different behaviour is *exactly* what we want. We want this:

$ LC_ALL=C.UTF-8 ls á
ls: cannot access 'á': No such file or directory

not this:

$ LC_ALL=C ls á
ls: cannot access ''$'\303\241': No such file or directory

I thought the argument would be that despite being what we wanted, it would 
break certain scenarios. But I haven't seen any examples of breakage.

>      gcc produces different output under C and C.UTF-8:
> 
>      echo x | LC_CTYPE=C gcc -xc -
>       <stdin>:1:1: error: expected '=', ',', ';', 'asm' or '__attribute__'
> at end of input
> 
>      echo x | LC_CTYPE=C.UTF-8 gcc -xc -
>       <stdin>:1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’
> at end of input
> 
>      As an additional twist, this different behaviour does not require fancy
> input, input is plain ASCII in both cases.
> 
>      Output parsers expecting "'" e.g. to produce a set recommendations how
> to quick-fix such problems in an IDE will break.

Any application that is parsing GCC output is already setting LC_ALL in the 
child process's environment. Otherwise, they'd be getting possibly translated 
messages and we all know that the order of the messages could be different. 
Not to mention that instead of "" or even “” we could see «» or „“.

Changing the environment of a child process is not going to go away.

If you're telling me that you're setting the environment before the Qt 
application to cope with its brokenness, I will ask why that application 
hasn't been fixed in the 16 years since UTF-8 environments became a thing. And 
we can provide a way to force Qt not to set the environment, for those weird 
cases where you musts deal with broken, proprietary cr#p that won't be fixed 
until the heat death of the Universe. And I will ask why everyone else must 
pay a performance price for the sake of those old, broken applications that 
even the maintainer isn't fixing anymore?

>      #include <locale.h>
>      #include <string.h>
>      #include <stdlib.h>
> 
>      int main()
>      {
>          if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0)
>              abort();
>      }
> 
>      runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8".

Strawman example, this doesn't happen in reality. See my exhaustive search for 
all such checks in an entire Linux distribution. I'm asking for *real* 
situations.

>      While contreived in this form, there _is_ code even in Creator checking
> for "C" literally, raising the suspicion that this might happen in other
> applications, too.

Oh, checking for "C" literally does exist, there were several in my search. 
About half of them were for *working* *around* the issue that someone set to C 
but meant something else. The problem is not to check for "C", it's what they 
do with that and I did not find a single example about the application 
failing. The worst scenario I found is that it worked more slowly than 
otherwise. Or, put more precisely, there was an optimisation for a corner-case 
that is rarely used today.

> _I_ am not trying to talk about broken applications, I am talking about
> applications that currently do work as intended by their authors which will
> break when Qt 6 forces them to run them under a LC_ALL=xxxx.UTF-8

How is that not what I said? If they don't support xxxx.UTF-8 now, which has 
been the default on Linux since 2003, how are they not broken?

> > Anyway, since you oppose setting the environment, let's just make a check
> > for assumption:
> > 
> > if (locale is not UTF-8)
> >     qFatal("Qt only supports UTF-8 locales. "
> >            "Please configure your system properly");
> 
> So instead of trying to understand, or even only to accept that other kids
> use their toys differently than we think they should, we'll break their
> toys. Yay.

The above is predicated on the only codec supported for QString and filenames 
being UTF-8, which was the beginning of this thread. If we're going to go back 
and revisit that decision, then there's no need to change the environment. I'm 
trying to find a solution that works in the context of the proposal of 
removing support for anything but UTF-8 in QString::fromLocal8Bit.

If we say that QString::fromLocal8Bit is UTF-8 on Unix, then either:
 a) we fix the environment when the locale says it isn't UTF-8
 b) we refuse to run when it isn't UTF-8
 c) we pretend it's UTF-8 and potentially produce and consume mojibake

> 
> Andre'
> 
> PS: There was an implied question in the preceding mails why we consider
> using LC_ALL when e.g. changing LC_CTYPE alone already is capabale of doing
> all the damage.

Got it, the qputenv would be for LC_TYPE in most cases.

>     Why do we want to use LC_ALL specifically?

If we decide we want to override the environment (case [a] above), then we 
need to set something:
 - if LC_ALL is set, then we have to override it
 - otherwise, we set/override LC_CTYPE, leaving LC_ALL unset

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products





More information about the Development mailing list