[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Wed Mar 22 17:35:57 CET 2023

> On 22 Mar 2023, at 12:07, Alvin Wong via Development <development at qt-project.org> wrote:
> On 22/3/2023 17:58, Lars Knoll wrote:
>> Hi,
>> 
>> 
>>> On 21 Mar 2023, at 17:46, Alvin Wong via Development <development at qt-project.org> wrote:
>>> 
>>> Hi,
>>> 
>>> Yes, embedding the manifest with activeCodePage set to UTF-8 is the only thing need to enable UTF-8 as the ANSI code page (ACP) for the process.
>>> 
>>> Qt itself should work fine after the bug in QStringConverter had been fixed [1] a while back. (You can also refer to the linked mail thread. [2]) However, as this bug has shown, any code that uses`MultiByteToWideChar` incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in which each characters are formed by no more than two bytes will break. Therefore, before switching to UTF-8 as the ACP, application developers have to check their code and other libraries to make sure everything will still work properly after the switch.
>>> 
>>> [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
>>> [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html
>>> 
>>> About the CRT, it is true that only UCRT fully supports UTF-8 locale. When compiling with MSVC, you are almost always using UCRT so it should be fine.
>>> 
>>> MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, the whole toolchain is already configured for a specific CRT. Usually it will be the system MSVCRT. (If it's configured for UCRT, the toolchain author will usually make it clear, because compiled programs will not run out-of-the-box on Windows 8.1 or earlier.) I did not run tests myself, but I would not trust MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are some examples of mingw-w64 toolchains that ships UCRT versions.
>>> 
>>> [3]: https://github.com/niXman/mingw-builds-binaries/releases
>>> [4]: https://github.com/mstorsjo/llvm-mingw
>>> 
>>> There are two more problems with enabling UTF-8 ACP using the manifest that I have encountered so far. When a process is running with UTF-8 ACP, there seems to be no API available to get the native system ACP. This can be an issue if, for example some external tools write files using the system ACP and your program wants to read those files. The other problem (a mild annoyance) is that, some debuggers which isn't using updated APIs (gdb for example) does not capture `OutputDebugString` messages in the correct encoding, which affects QDebug output.
>>> 
>>> 
>> I’ve looked into that one when we did the work for Qt 6. The console has its own code page that can be set independently from the app, and I believe also independently from the system code page. qDebug() should be mostly fine, as we’re using OutputDebugStringW() internally and let Windows handle this mess.
>> 
>> What it does affect is writing to stdout/err and OutputDebugStringA(). 
>> 
> It is unfortunately a bit more messy. OutputDebugString communicates with the debugger via a debug event which contains an address, then the debugger reads the debug message from the memory space of the debuggee process.
> The documentation of OutputDebugStringW [1] states:
> "In the past, the operating system did not return Unicode strings through OutputDebugStringW (ASCII strings were returned instead). To force OutputDebugStringW to return Unicode strings, debuggers are required to call the WaitForDebugEventEx function to opt into the new behavior. In this way, the operating system knows that the debugger supports Unicode and is specifically opting into receiving Unicode strings."
> "OutputDebugStringW converts the specified string based on the current system locale information and passes it to OutputDebugStringA to be displayed. As a result, some Unicode characters may not be displayed correctly."
> What happens with a debugger that does not call `WaitForDebugEventEx` (e.g. gdb) is this: The debuggee calls OutputDebugStringW, which converts the debug string to ACP (UTF-8 in this case) to be passed to OutputDebugStringA. Then the debugger receives the event and tries to read the debug string from the debuggee as ACP, but the debugger thinks ACP is the system ACP (Windows-1252, CP950 or whatever) so it ends up displaying mojibake. The same also happens with Sysinternals DebugView.
> In reality, most of the debug messages are ASCII, so this issue rarely affects anything and I consider it just "a mild annoyance".
> [1]: https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw
>> 
>>> (Console output encoding is separate from the ACP, so one might also need to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)
>>> 
>> Setting the code page for console output should help when writing to stdout/err. It’ll require a bit of testing again (it’s been a while since I looked into it), but I believe console was mostly handling this fine independent of the codepage being used by it internally (ie. Windows would recode the string).
>> 
>> Cheers,
>> Lars

Hi,

Here’s one patch (thanks Ilya) that we recently applied to make sure that Qt applications using UTF-8 can still exchange data with other applications that don’t:

https://codereview.qt-project.org/c/qt/qtbase/+/459275

Data exchange over the clipboard is perhaps not the only case where encoding from Qt to native 8 bit encoding. We consistently use the ‘W’ APIs. But we use toLocal8Bit in plenty of cases as well. For instance in our Qt SQL APIs. If the DBMS expects 8bit string data in the system’s code page, then this will fail if Qt starts writing UTF-8.

And in a Qt application, not everything is done with Qt anyway. There might be usage of native APIs that is expected to produce or receive 8bit data in the system's code page. I don’t think we can guarantee that this will be handled transparently by the system in all cases, as it fortunately does in the clipboard case if we don’t provide UTF-8 text.

So, defaulting a Qt application to UTF-8 is not an option on Windows, I think; it needs to be opt-in, at least until it has become an established norm, and a switch has to be done over time with big red flags in the documentation, and an easy way out for users that need to exchange data with legacy software systems.

Volker