[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Thiago Macieira thiago.macieira at intel.com
Thu Oct 31 22:11:05 CET 2019


Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move 
QTextCodec support out of QtCore)
See also: https://www.python.org/dev/peps/pep-0538/
	https://www.python.org/dev/peps/pep-0540/

Summary:
The change above, while removing QTextCodec from our API, had the side-effect 
of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be 
recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix 
systems on Qt 6. This does not apply to Windows because on Windows we cannot 
reasonably be expected to use UTF-8 for the 8-bit encoding.

There are three questions to be decided:
 a) What should Qt 6 assume the locale to be, if no locale is set?
 b) In case a non-UTF-8 locale is set, what should we do?
 c) Should we propagate our decision to child processes?

My personal preference is:
 a) C.UTF-8
 b) override it to force UTF-8 on the same locale
 c) yes

Long explanation:

On Unix systems, traditionally, the locale is a factor of multiple environment 
variables starting with LC_ (matching macro names from <locale.h>), as well as 
the LANG and LANGUAGES variables. If none of those is set, the C and POSIX 
standards say that the default locale is "C". Moreover, POSIX says that the 
"POSIX" locale is "C" and does not have multibyte encodings -- that excludes 
its encoding from being UTF-8.

Most modern Unix-based operating systems do set a reasonable, UTF8-based 
locale for the user. They've been doing that for about 15 years -- it was in 
2003 that this started, when I had to switch from zsh back to bash because zsh 
didn't support UTF-8 yet, but switched back in 2005 when it gained support. On 
top of that, some even more recent Unix offerings -- namely, macOS and Android 
-- enforce that the default (or only!) locale encoding is UTF-8.

Right now, Qt faithfully accepts the locale configuration set by the user in 
the environment. It can do that because it has QTextCodec, which is also 
backed by either the libiconv routines or by ICU, so it can deal with any 
encoding. In properly-configured environments, there's no problem.

The two Python documents above (PEP-538 and 540) also discuss how Python 
changed its strategy. I'm proposing that we follow Python and go a little 
further. 

What's the problem?

The problem is where the locale is not set up properly or it is explicitly 
overriden. See PEP-538 for examples in containers, but as can be seen from it, 
Linux will default to "POSIX" or empty, which means Qt will interpret the 
locale as US-ASCII, which is almost never what is intended. Moreover, because 
of our use of QString for file names, any name that contains code units above 
0x7f will be deemed a filesystem corruption and ignored on directory listing 
-- they are not representable.

Furthermore, it happens quite often that users and tools set LC_ALL to "C" in 
order to obtain messages in English, so they can be parsed by other tools or 
to be pasted in emails (every time you see me post an error message from a 
console, I've done that). There are alternative locales that can be used, like 
"C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and 
may not be actually available.

Arguing that this is an incorrect setup, while factually correct, does not 
change the fact that it happens.

Questions and options:

a) What should Qt 6 assume when the locale is unset or is just "C"?

This is the case of a simple environment where the variables are unset or have 
some legacy system-wide defaults, as well as when the user explicitly sets 
LC_ALL to "C". The options are:
 - accept them as-is
 - assume that C with UTF-8 support was intended

The first option is what we have today. And if this is our option, then 
neither question b or c make sense.

The second option implies doing the check in QCoreApplication right after 
setlocale(LC_ALL, ""):
   if (strcmp(setlocale(LC_ALL, NULL), "C") == 0)
      setlocale(LC_CTYPE, "C.UTF-8");

b) What should Qt 6 do if a different locale, other than C, is non-UTF8?

This case is not an accident, most of the time. It can happen from time to 
time that someone is simply testing different languages and forces LC_ALL to 
something non-default to see what happens. They'll very quickly try the UTF-8 
versions. But when it's not an accident, it means it was intended. This is the 
general state of Unix prior to 2003, when locales like "en_US", "en_GB", 
"fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants "fr_FR at euro", 
"de_DE at euro", "nl_NL at euro", etc. Options are:

 - accept them as-is (this is what Python does)
 - assume that the UTF-8 variant was intended, just not properly set

The first option is what we have today, aside from the C locale (question 
(a)). However, keeping that option working implies keeping either ICU or iconv 
working in Qt 6 and we might want to get rid of that dependency for codecs.

The second option implies modifying the QCoreApplication change above. Instead 
of explicitly checking for the C locale, we'd use nl_langinfo(CODESET) to find 
out what codec the locale is expecting. If it's not UTF-8, then we'd compose a 
new LC_CTYPE locale based on what the locale was and UTF-8. That means we'd 
transform:

 ""		→ "C.UTF-8"
 "C"		→ "C.UTF-8"
 "en_US"		→ "en_US.UTF-8"
 "fr_FR at euro"	→ "fr_FR.UTF-8 at euro"
 "zh_CN.GB18030"	→ "zh_CN.UTF-8"

c) Should we propagate our decision to child processes?

It's not possible to propagate choices to any other processes, so the question 
is only to child ones. Asked differently: should we set our choice in the 
application environment, so it's inherited by child processes?

Child applications written with Qt 6 would not be affected, aside from maybe a 
negligible load time improvement. But any other applications, including Qt 5 
ones, would not make the same choices. If we do not propagate, we could end up 
with a child process (often helpers) that make different choices as to what 
command-line arguments or pipes or contents in files mean.

Note that we can't affect the *parent* process, so this problem could happen 
there.

Welcome side-effect: other libraries and user's own code in the same process 
can call setlocale() after QCoreApplication has. It's possible that they, 
unknowingly, override our choices and change the C library back to an 
incorrect state. If we do set the environment, this cannot happen.

Another side-effect is that in a Qt-based graphical environment, the "right" 
choice will be propagated anyway, to all child processes.

Options are:
- yes (this is what Python does)
- no

Bonus d) should we print a warning when we've made a change?

Options are:
- yes, for all of them
- yes, but only for locales other than "C"
- no

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products





More information about the Development mailing list