[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Lars Knoll
lars.knoll at qt.io
Fri Nov 1 10:21:48 CET 2019
Hi Thiago,
Thanks for the comprehensive mail.
> On 31 Oct 2019, at 22:11, Thiago Macieira <thiago.macieira at intel.com> wrote:
>
> Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move
> QTextCodec support out of QtCore)
> See also: https://www.python.org/dev/peps/pep-0538/
> https://www.python.org/dev/peps/pep-0540/
>
> Summary:
> The change above, while removing QTextCodec from our API, had the side-effect
> of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be
> recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix
> systems on Qt 6. This does not apply to Windows because on Windows we cannot
> reasonably be expected to use UTF-8 for the 8-bit encoding.
I do not think we have to worry about the local 8 bit encoding on Windows anymore these days. All our interaction with the OS goes through the 16 bit APIs (ie. uses UTF-16). I don’t think file content is a huge issue neither anymore as Windows 10 seems to have added UTF-8 support to most of it’s tools.
Afaik, we can also use a Unicode API for console and debug output, so the only piece that’s left might be our users interacting with legacy ANSI APIs. That should be a rare case and it should be straightforward to port that over to use the Unicode API instead.
>
> There are three questions to be decided:
> a) What should Qt 6 assume the locale to be, if no locale is set?
> b) In case a non-UTF-8 locale is set, what should we do?
> c) Should we propagate our decision to child processes?
>
> My personal preference is:
> a) C.UTF-8
> b) override it to force UTF-8 on the same locale
> c) yes
I agree with all three choices. For your bonus (d) below, I’d say we should print a warning if we encounter a non UTF-8 locale other than C.
Cheers,
Lars
>
> Long explanation:
>
> On Unix systems, traditionally, the locale is a factor of multiple environment
> variables starting with LC_ (matching macro names from <locale.h>), as well as
> the LANG and LANGUAGES variables. If none of those is set, the C and POSIX
> standards say that the default locale is "C". Moreover, POSIX says that the
> "POSIX" locale is "C" and does not have multibyte encodings -- that excludes
> its encoding from being UTF-8.
>
> Most modern Unix-based operating systems do set a reasonable, UTF8-based
> locale for the user. They've been doing that for about 15 years -- it was in
> 2003 that this started, when I had to switch from zsh back to bash because zsh
> didn't support UTF-8 yet, but switched back in 2005 when it gained support. On
> top of that, some even more recent Unix offerings -- namely, macOS and Android
> -- enforce that the default (or only!) locale encoding is UTF-8.
>
> Right now, Qt faithfully accepts the locale configuration set by the user in
> the environment. It can do that because it has QTextCodec, which is also
> backed by either the libiconv routines or by ICU, so it can deal with any
> encoding. In properly-configured environments, there's no problem.
>
> The two Python documents above (PEP-538 and 540) also discuss how Python
> changed its strategy. I'm proposing that we follow Python and go a little
> further.
>
> What's the problem?
>
> The problem is where the locale is not set up properly or it is explicitly
> overriden. See PEP-538 for examples in containers, but as can be seen from it,
> Linux will default to "POSIX" or empty, which means Qt will interpret the
> locale as US-ASCII, which is almost never what is intended. Moreover, because
> of our use of QString for file names, any name that contains code units above
> 0x7f will be deemed a filesystem corruption and ignored on directory listing
> -- they are not representable.
>
> Furthermore, it happens quite often that users and tools set LC_ALL to "C" in
> order to obtain messages in English, so they can be parsed by other tools or
> to be pasted in emails (every time you see me post an error message from a
> console, I've done that). There are alternative locales that can be used, like
> "C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and
> may not be actually available.
>
> Arguing that this is an incorrect setup, while factually correct, does not
> change the fact that it happens.
>
> Questions and options:
>
> a) What should Qt 6 assume when the locale is unset or is just "C"?
>
> This is the case of a simple environment where the variables are unset or have
> some legacy system-wide defaults, as well as when the user explicitly sets
> LC_ALL to "C". The options are:
> - accept them as-is
> - assume that C with UTF-8 support was intended
>
> The first option is what we have today. And if this is our option, then
> neither question b or c make sense.
>
> The second option implies doing the check in QCoreApplication right after
> setlocale(LC_ALL, ""):
> if (strcmp(setlocale(LC_ALL, NULL), "C") == 0)
> setlocale(LC_CTYPE, "C.UTF-8");
>
> b) What should Qt 6 do if a different locale, other than C, is non-UTF8?
>
> This case is not an accident, most of the time. It can happen from time to
> time that someone is simply testing different languages and forces LC_ALL to
> something non-default to see what happens. They'll very quickly try the UTF-8
> versions. But when it's not an accident, it means it was intended. This is the
> general state of Unix prior to 2003, when locales like "en_US", "en_GB",
> "fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants "fr_FR at euro",
> "de_DE at euro", "nl_NL at euro", etc. Options are:
>
> - accept them as-is (this is what Python does)
> - assume that the UTF-8 variant was intended, just not properly set
>
> The first option is what we have today, aside from the C locale (question
> (a)). However, keeping that option working implies keeping either ICU or iconv
> working in Qt 6 and we might want to get rid of that dependency for codecs.
>
> The second option implies modifying the QCoreApplication change above. Instead
> of explicitly checking for the C locale, we'd use nl_langinfo(CODESET) to find
> out what codec the locale is expecting. If it's not UTF-8, then we'd compose a
> new LC_CTYPE locale based on what the locale was and UTF-8. That means we'd
> transform:
>
> "" → "C.UTF-8"
> "C" → "C.UTF-8"
> "en_US" → "en_US.UTF-8"
> "fr_FR at euro" → "fr_FR.UTF-8 at euro"
> "zh_CN.GB18030" → "zh_CN.UTF-8"
>
> c) Should we propagate our decision to child processes?
>
> It's not possible to propagate choices to any other processes, so the question
> is only to child ones. Asked differently: should we set our choice in the
> application environment, so it's inherited by child processes?
>
> Child applications written with Qt 6 would not be affected, aside from maybe a
> negligible load time improvement. But any other applications, including Qt 5
> ones, would not make the same choices. If we do not propagate, we could end up
> with a child process (often helpers) that make different choices as to what
> command-line arguments or pipes or contents in files mean.
>
> Note that we can't affect the *parent* process, so this problem could happen
> there.
>
> Welcome side-effect: other libraries and user's own code in the same process
> can call setlocale() after QCoreApplication has. It's possible that they,
> unknowingly, override our choices and change the C library back to an
> incorrect state. If we do set the environment, this cannot happen.
>
> Another side-effect is that in a Qt-based graphical environment, the "right"
> choice will be propagated anyway, to all child processes.
>
> Options are:
> - yes (this is what Python does)
> - no
>
> Bonus d) should we print a warning when we've made a change?
>
> Options are:
> - yes, for all of them
> - yes, but only for locales other than "C"
> - no
>
> --
> Thiago Macieira - thiago.macieira (AT) intel.com
> Software Architect - Intel System Software Products
>
>
>
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> https://lists.qt-project.org/listinfo/development
More information about the Development
mailing list