[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Lars Knoll lars.knoll at qt.io
Fri Nov 1 10:21:48 CET 2019

Hi Thiago,

Thanks for the comprehensive mail.

> On 31 Oct 2019, at 22:11, Thiago Macieira <thiago.macieira at intel.com> wrote:
> Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move 
> QTextCodec support out of QtCore)
> See also: https://www.python.org/dev/peps/pep-0538/
> 	https://www.python.org/dev/peps/pep-0540/
> Summary:
> The change above, while removing QTextCodec from our API, had the side-effect 
> of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be 
> recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix 
> systems on Qt 6. This does not apply to Windows because on Windows we cannot 
> reasonably be expected to use UTF-8 for the 8-bit encoding.

I do not think we have to worry about the local 8 bit encoding on Windows anymore these days. All our interaction with the OS goes through the 16 bit APIs (ie. uses UTF-16). I don’t think file content is a huge issue neither anymore as Windows 10 seems to have added UTF-8 support to most of it’s tools.

Afaik, we can also use a Unicode API for console  and debug output, so the only piece that’s left might be our users interacting with legacy ANSI APIs. That should be a rare case and it should be straightforward to port that over to use the Unicode API instead.

> There are three questions to be decided:
> a) What should Qt 6 assume the locale to be, if no locale is set?
> b) In case a non-UTF-8 locale is set, what should we do?
> c) Should we propagate our decision to child processes?
> My personal preference is:
> a) C.UTF-8
> b) override it to force UTF-8 on the same locale
> c) yes

I agree with all three choices. For your bonus (d) below, I’d say we should print a warning if we encounter a non UTF-8 locale other than C.


> Long explanation:
> On Unix systems, traditionally, the locale is a factor of multiple environment 
> variables starting with LC_ (matching macro names from <locale.h>), as well as 
> the LANG and LANGUAGES variables. If none of those is set, the C and POSIX 
> standards say that the default locale is "C". Moreover, POSIX says that the 
> "POSIX" locale is "C" and does not have multibyte encodings -- that excludes 
> its encoding from being UTF-8.
> Most modern Unix-based operating systems do set a reasonable, UTF8-based 
> locale for the user. They've been doing that for about 15 years -- it was in 
> 2003 that this started, when I had to switch from zsh back to bash because zsh 
> didn't support UTF-8 yet, but switched back in 2005 when it gained support. On 
> top of that, some even more recent Unix offerings -- namely, macOS and Android 
> -- enforce that the default (or only!) locale encoding is UTF-8.
> Right now, Qt faithfully accepts the locale configuration set by the user in 
> the environment. It can do that because it has QTextCodec, which is also 
> backed by either the libiconv routines or by ICU, so it can deal with any 
> encoding. In properly-configured environments, there's no problem.
> The two Python documents above (PEP-538 and 540) also discuss how Python 
> changed its strategy. I'm proposing that we follow Python and go a little 
> further. 
> What's the problem?
> The problem is where the locale is not set up properly or it is explicitly 
> overriden. See PEP-538 for examples in containers, but as can be seen from it, 
> Linux will default to "POSIX" or empty, which means Qt will interpret the 
> locale as US-ASCII, which is almost never what is intended. Moreover, because 
> of our use of QString for file names, any name that contains code units above 
> 0x7f will be deemed a filesystem corruption and ignored on directory listing 
> -- they are not representable.
> Furthermore, it happens quite often that users and tools set LC_ALL to "C" in 
> order to obtain messages in English, so they can be parsed by other tools or 
> to be pasted in emails (every time you see me post an error message from a 
> console, I've done that). There are alternative locales that can be used, like 
> "C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and 
> may not be actually available.
> Arguing that this is an incorrect setup, while factually correct, does not 
> change the fact that it happens.
> Questions and options:
> a) What should Qt 6 assume when the locale is unset or is just "C"?
> This is the case of a simple environment where the variables are unset or have 
> some legacy system-wide defaults, as well as when the user explicitly sets 
> LC_ALL to "C". The options are:
> - accept them as-is
> - assume that C with UTF-8 support was intended
> The first option is what we have today. And if this is our option, then 
> neither question b or c make sense.
> The second option implies doing the check in QCoreApplication right after 
> setlocale(LC_ALL, ""):
>   if (strcmp(setlocale(LC_ALL, NULL), "C") == 0)
>      setlocale(LC_CTYPE, "C.UTF-8");
> b) What should Qt 6 do if a different locale, other than C, is non-UTF8?
> This case is not an accident, most of the time. It can happen from time to 
> time that someone is simply testing different languages and forces LC_ALL to 
> something non-default to see what happens. They'll very quickly try the UTF-8 
> versions. But when it's not an accident, it means it was intended. This is the 
> general state of Unix prior to 2003, when locales like "en_US", "en_GB", 
> "fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants "fr_FR at euro", 
> "de_DE at euro", "nl_NL at euro", etc. Options are:
> - accept them as-is (this is what Python does)
> - assume that the UTF-8 variant was intended, just not properly set
> The first option is what we have today, aside from the C locale (question 
> (a)). However, keeping that option working implies keeping either ICU or iconv 
> working in Qt 6 and we might want to get rid of that dependency for codecs.
> The second option implies modifying the QCoreApplication change above. Instead 
> of explicitly checking for the C locale, we'd use nl_langinfo(CODESET) to find 
> out what codec the locale is expecting. If it's not UTF-8, then we'd compose a 
> new LC_CTYPE locale based on what the locale was and UTF-8. That means we'd 
> transform:
> ""		→ "C.UTF-8"
> "C"		→ "C.UTF-8"
> "en_US"		→ "en_US.UTF-8"
> "fr_FR at euro"	→ "fr_FR.UTF-8 at euro"
> "zh_CN.GB18030"	→ "zh_CN.UTF-8"
> c) Should we propagate our decision to child processes?
> It's not possible to propagate choices to any other processes, so the question 
> is only to child ones. Asked differently: should we set our choice in the 
> application environment, so it's inherited by child processes?
> Child applications written with Qt 6 would not be affected, aside from maybe a 
> negligible load time improvement. But any other applications, including Qt 5 
> ones, would not make the same choices. If we do not propagate, we could end up 
> with a child process (often helpers) that make different choices as to what 
> command-line arguments or pipes or contents in files mean.
> Note that we can't affect the *parent* process, so this problem could happen 
> there.
> Welcome side-effect: other libraries and user's own code in the same process 
> can call setlocale() after QCoreApplication has. It's possible that they, 
> unknowingly, override our choices and change the C library back to an 
> incorrect state. If we do set the environment, this cannot happen.
> Another side-effect is that in a Qt-based graphical environment, the "right" 
> choice will be propagated anyway, to all child processes.
> Options are:
> - yes (this is what Python does)
> - no
> Bonus d) should we print a warning when we've made a change?
> Options are:
> - yes, for all of them
> - yes, but only for locales other than "C"
> - no
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel System Software Products
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> https://lists.qt-project.org/listinfo/development

More information about the Development mailing list