[Development] Utf8 as the 8bit encoding on Windows

Lars Knoll lars.knoll at qt.io
Mon May 4 15:46:44 CEST 2020


Hi all,

For Qt6, we wanted to finalise missing holes in our Unicode story. One of the changes that we’ve already decided upon at the Contributor Summit was to enforce UTF8 based locales on Unix systems (they’ve been the default there for the last 10-15 years).

I would really like to make this 100% cross-platform, so I sat down and did a bit of research on Windows. Things are looking pretty promising, at least since Windows 10, build 1903.

Windows actually has two 8bit code pages that you need to take care of when writing an application. There’s the application code page (which can be retrieved by the GetACP() method (1).  The application code page could (until build 1903) only be changed for the system as a whole, but can now also be changed on a per application basis using a manifest file (2). 

The console has a separate encoding, that can be changed programmatically from within the app (3). These code pages can be different from each other, making things interesting.

I’ve been running a couple of tests using some a simple test program (4) and a manifest file to change the ACP to utf8 as described in (3). Here are the results:

* On my machine, the default output code page for the console is CP850. My default ansi codepage is Windows-1252. So they don't agree, and writing to the console using toLocal8Bit() can/will lead to mojibake in some cases. As such our current handling is already broken, as we always convert to loca8bit using the application code page also for stdin/out/err.
* Setting the applications code page to utf8 using a manifest works nicely, and GetACP() returns 65001 (UTF8) instead of 1251.
* SetConsoleOutputCP(CP_UTF8) also works reliably. Writing utf8 based data to stdout, or even using _write(1, data, len) gives the correct output on the console. This even works when writing one char at a time (ie. incomplete utf8 sequences).

* With our current handling, none of the test strings show up correctly on the console. 
* setting the output code page makes writing to stdout/stderr with toUtf8() work correctly. qDebug() is still not working
* setting the manifest in addition will also make qDebug() work correctly.
* QTextStream still delivers mojibake in all cases. I assume there’s a bug somewhere in the way we handle things in QTextStream, this needs some debugging.

Conclusions:

So to me this looks like you can get a 100% utf8/utf16 setup for your apps that is compatible with what we do on Unix starting with Windows 10 build 1903 or later. To get this fully working, the requirements for Qt would then be to call SetConsoleOutputCP(CP_UTF8) + SetConsoleCP(CP_UTF8) and the build system needs to add a manifest file that sets the application code page to utf8.

As the code page for the console is not compatible with the ansi code page, I don't see why we shouldn't change the console code pages in any case. In addition, I think we should add the manifest file to the app through the build system by default (and offer a switch to turn both the console code page and manifest handling off).

I think this would be mostly a positive change and won't break too many things for our users. Most Qt apps don't make heavy use of the 8bit APIs. If they do, they need to be prepared to handle different code pages anyway, so changing to utf8 should not break anything for them. Filenames are encoded in utf16 on NTFS, so setting the ACP to utf8 would make all files accessible by the 8bit APIs. 

Other than that I can't really think of many potential issues, as the main Windows APIs are usually the 16bit APIs, and the 8bit ones are only wrappers around those.

Comments? Am I missing something vital?

Cheers,
Lars

(1) https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
(2) https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
(3) https://docs.microsoft.com/en-us/windows/console/setconsoleoutputcp
(4)

#include <qdebug.h>
#include <Windows.h>
#include <io.h>

int main(int, char**)
{
    uint cp = GetConsoleOutputCP();
    qDebug() << "COnsole CP" << cp << GetACP();
    SetConsoleOutputCP(CP_UTF8);
    qDebug() << "Hello" << QString::fromUtf8("Ελληνικά");
    printf("Ελληνικά\n");
    QByteArray greek = "Ελληνικά\n";
    fprintf(stdout, greek.constData());
    for (char c : greek)
        fprintf(stdout, "%c", c);
    _write(1, greek.constData(), greek.length());
    for (char c : greek)
        _write(1, &c, 1);

    QTextStream ts(stdout);
    ts.setEncoding(QStringConverter::Utf8);
    ts << QString::fromUtf8(greek);
    SetConsoleOutputCP(cp);
    return 0;
}




More information about the Development mailing list