[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Alvin Wong
alvin at alvinhc.com
Tue Mar 21 17:46:32 CET 2023
Hi,
Yes, embedding the manifest with activeCodePage set to UTF-8 is the only
thing need to enable UTF-8 as the ANSI code page (ACP) for the process.
Qt itself should work fine after the bug in QStringConverter had been
fixed [1] a while back. (You can also refer to the linked mail thread.
[2]) However, as this bug has shown, any code that
uses`MultiByteToWideChar` incorrectly or wrongly assumes that `CP_ACP`
always refers to a charset in which each characters are formed by no
more than two bytes will break. Therefore, before switching to UTF-8 as
the ACP, application developers have to check their code and other
libraries to make sure everything will still work properly after the switch.
[1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
[2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html
About the CRT, it is true that only UCRT fully supports UTF-8 locale.
When compiling with MSVC, you are almost always using UCRT so it should
be fine.
MinGW-w64 is a bit more complicated -- when one gets a mingw-w64
toolchain, the whole toolchain is already configured for a specific CRT.
Usually it will be the system MSVCRT. (If it's configured for UCRT, the
toolchain author will usually make it clear, because compiled programs
will not run out-of-the-box on Windows 8.1 or earlier.) I did not run
tests myself, but I would not trust MSVCRT to support UTF-8 ACP fully.
mingw-builds [3] and llvm-mingw [4] are some examples of mingw-w64
toolchains that ships UCRT versions.
[3]: https://github.com/niXman/mingw-builds-binaries/releases
[4]: https://github.com/mstorsjo/llvm-mingw
There are two more problems with enabling UTF-8 ACP using the manifest
that I have encountered so far. When a process is running with UTF-8
ACP, there seems to be no API available to get the native system ACP.
This can be an issue if, for example some external tools write files
using the system ACP and your program wants to read those files. The
other problem (a mild annoyance) is that, some debuggers which isn't
using updated APIs (gdb for example) does not capture
`OutputDebugString` messages in the correct encoding, which affects
QDebug output.
(Console output encoding is separate from the ACP, so one might also
need to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit
fuzzy to me.)
Cheers,
Alvin
On 20/3/2023 21:44, Edward Welbourne wrote:
> Thiago Macieira (31 October 2019 22:11) wrote [0]:
>> This RFC (...) is meant to discuss how we'll deal with locales on Unix
>> systems on Qt 6. This does not apply to Windows because on Windows we
>> cannot reasonably be expected to use UTF-8 for the 8-bit encoding.
> [0] https://lists.qt-project.org/pipermail/development/2019-October/037791.html
>
> The GNU make mailing list currently has a thread (starts at [1]) about
> handling of encodings on Windows.
>
> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html
>
> The discussion there seems to indicate that setting the system code-page
> to UTF-8 can be done in a way that interoperates gracefully with other
> processes and the file system, presumably thanks to the system being
> substantially UTF-16-based, so all 8-bit encodings go via that anyway.
>
> The means to achieve this appear [2] to hinge on setting the active
> codepage for the application in a manifest file, that it gets combined
> with after it is linked.
>
> [2] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
>
> There do appear to be some vagaries still, it may depend on UCRT and I'm
> not sure I've really understood it all, but it looks like we may, in
> time, be able to consistently use UTF-8 as 8-bit encoding on Windows.
>
> Eddy.
>
More information about the Development
mailing list