[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Thiago Macieira
thiago.macieira at intel.com
Thu Oct 31 22:11:05 CET 2019
Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move
QTextCodec support out of QtCore)
See also: https://www.python.org/dev/peps/pep-0538/
https://www.python.org/dev/peps/pep-0540/
Summary:
The change above, while removing QTextCodec from our API, had the side-effect
of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be
recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix
systems on Qt 6. This does not apply to Windows because on Windows we cannot
reasonably be expected to use UTF-8 for the 8-bit encoding.
There are three questions to be decided:
a) What should Qt 6 assume the locale to be, if no locale is set?
b) In case a non-UTF-8 locale is set, what should we do?
c) Should we propagate our decision to child processes?
My personal preference is:
a) C.UTF-8
b) override it to force UTF-8 on the same locale
c) yes
Long explanation:
On Unix systems, traditionally, the locale is a factor of multiple environment
variables starting with LC_ (matching macro names from <locale.h>), as well as
the LANG and LANGUAGES variables. If none of those is set, the C and POSIX
standards say that the default locale is "C". Moreover, POSIX says that the
"POSIX" locale is "C" and does not have multibyte encodings -- that excludes
its encoding from being UTF-8.
Most modern Unix-based operating systems do set a reasonable, UTF8-based
locale for the user. They've been doing that for about 15 years -- it was in
2003 that this started, when I had to switch from zsh back to bash because zsh
didn't support UTF-8 yet, but switched back in 2005 when it gained support. On
top of that, some even more recent Unix offerings -- namely, macOS and Android
-- enforce that the default (or only!) locale encoding is UTF-8.
Right now, Qt faithfully accepts the locale configuration set by the user in
the environment. It can do that because it has QTextCodec, which is also
backed by either the libiconv routines or by ICU, so it can deal with any
encoding. In properly-configured environments, there's no problem.
The two Python documents above (PEP-538 and 540) also discuss how Python
changed its strategy. I'm proposing that we follow Python and go a little
further.
What's the problem?
The problem is where the locale is not set up properly or it is explicitly
overriden. See PEP-538 for examples in containers, but as can be seen from it,
Linux will default to "POSIX" or empty, which means Qt will interpret the
locale as US-ASCII, which is almost never what is intended. Moreover, because
of our use of QString for file names, any name that contains code units above
0x7f will be deemed a filesystem corruption and ignored on directory listing
-- they are not representable.
Furthermore, it happens quite often that users and tools set LC_ALL to "C" in
order to obtain messages in English, so they can be parsed by other tools or
to be pasted in emails (every time you see me post an error message from a
console, I've done that). There are alternative locales that can be used, like
"C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and
may not be actually available.
Arguing that this is an incorrect setup, while factually correct, does not
change the fact that it happens.
Questions and options:
a) What should Qt 6 assume when the locale is unset or is just "C"?
This is the case of a simple environment where the variables are unset or have
some legacy system-wide defaults, as well as when the user explicitly sets
LC_ALL to "C". The options are:
- accept them as-is
- assume that C with UTF-8 support was intended
The first option is what we have today. And if this is our option, then
neither question b or c make sense.
The second option implies doing the check in QCoreApplication right after
setlocale(LC_ALL, ""):
if (strcmp(setlocale(LC_ALL, NULL), "C") == 0)
setlocale(LC_CTYPE, "C.UTF-8");
b) What should Qt 6 do if a different locale, other than C, is non-UTF8?
This case is not an accident, most of the time. It can happen from time to
time that someone is simply testing different languages and forces LC_ALL to
something non-default to see what happens. They'll very quickly try the UTF-8
versions. But when it's not an accident, it means it was intended. This is the
general state of Unix prior to 2003, when locales like "en_US", "en_GB",
"fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants "fr_FR at euro",
"de_DE at euro", "nl_NL at euro", etc. Options are:
- accept them as-is (this is what Python does)
- assume that the UTF-8 variant was intended, just not properly set
The first option is what we have today, aside from the C locale (question
(a)). However, keeping that option working implies keeping either ICU or iconv
working in Qt 6 and we might want to get rid of that dependency for codecs.
The second option implies modifying the QCoreApplication change above. Instead
of explicitly checking for the C locale, we'd use nl_langinfo(CODESET) to find
out what codec the locale is expecting. If it's not UTF-8, then we'd compose a
new LC_CTYPE locale based on what the locale was and UTF-8. That means we'd
transform:
"" → "C.UTF-8"
"C" → "C.UTF-8"
"en_US" → "en_US.UTF-8"
"fr_FR at euro" → "fr_FR.UTF-8 at euro"
"zh_CN.GB18030" → "zh_CN.UTF-8"
c) Should we propagate our decision to child processes?
It's not possible to propagate choices to any other processes, so the question
is only to child ones. Asked differently: should we set our choice in the
application environment, so it's inherited by child processes?
Child applications written with Qt 6 would not be affected, aside from maybe a
negligible load time improvement. But any other applications, including Qt 5
ones, would not make the same choices. If we do not propagate, we could end up
with a child process (often helpers) that make different choices as to what
command-line arguments or pipes or contents in files mean.
Note that we can't affect the *parent* process, so this problem could happen
there.
Welcome side-effect: other libraries and user's own code in the same process
can call setlocale() after QCoreApplication has. It's possible that they,
unknowingly, override our choices and change the C library back to an
incorrect state. If we do set the environment, this cannot happen.
Another side-effect is that in a Qt-based graphical environment, the "right"
choice will be propagated anyway, to all child processes.
Options are:
- yes (this is what Python does)
- no
Bonus d) should we print a warning when we've made a change?
Options are:
- yes, for all of them
- yes, but only for locales other than "C"
- no
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel System Software Products
More information about the Development
mailing list