[Development] Utf8 as the 8bit encoding on Windows

Thiago Macieira thiago.macieira at intel.com
Mon May 11 17:08:04 CEST 2020


On Monday, 4 May 2020 06:46:44 PDT Lars Knoll wrote:
> Hi all,

Hello Lars

> I would really like to make this 100% cross-platform, so I sat down and did
> a bit of research on Windows. Things are looking pretty promising, at least
> since Windows 10, build 1903.

Requiring that Windows version is, unfortunately, not yet acceptable. Many 
corporations, Intel included, have a slow Windows upgrade cycle. According to 
the WIkipedia page[1], MSFT is offering a 2½-year support cycle for each of 
those Windows 10 releases, so we need to support pre-1903 releases until at 
least May 11, 2021. That's past the Qt 6 release date.

There's also Windows 8.1, which is supported until January 2023[2]

And then there's all the unsupported Windows versions that continue to be 
used.

[1] https://en.wikipedia.org/wiki/Windows_10_version_history
[2] https://support.microsoft.com/en-us/help/13853/windows-lifecycle-fact-sheet

> Windows actually has two 8bit code pages that you need to take care of when
> writing an application. There’s the application code page (which can be
> retrieved by the GetACP() method (1).  The application code page could
> (until build 1903) only be changed for the system as a whole, but can now
> also be changed on a per application basis using a manifest file (2). 
 
> The console has a separate encoding, that can be changed programmatically
> from within the app (3). These code pages can be different from each other,
> making things interesting.

They almost always are, since the console codepage is the DOS codepage. For 
most of the Western world, that's CP 850. The problem here is not the 
application, but the terminal application. You can see the same issue if you 
go to View > Set Encoding in Konsole and change to something other than your 
current locale's encoding.

TBH, since Qt is not designed to deal with console applications, especially on 
Windows, I'm going to simply say we should ignore the DOS/console codepage and 
say "not our problem".

The setting of GetACP is the one we need to deal with.
 
> *
> SetConsoleOutputCP(CP_UTF8) also works reliably. Writing utf8 based data to
> stdout, or even using _write(1, data, len) gives the correct output on the
> console. This even works when writing one char at a time (ie. incomplete
> utf8 sequences). 
> * With our current handling, none of the test strings show up correctly on
> the console. 

I'd say ignore the console codepage. It's been broken since inception and can 
be fixed by a simple command-line (chcp 65001).

> * Setting
> the applications code page to utf8 using a manifest works nicely, and
> GetACP() returns 65001 (UTF8) instead of 1251. 
> * setting the manifest in addition will also make qDebug() work correctly. 
> * QTextStream still delivers mojibake in all cases. I assume there’s a bug
> somewhere in the way we handle things in QTextStream, this needs some
> debugging. 

Let's figure out if QTextStream has a bug after your changes to 
QStringEncoder/Decoder land.

> As the code page for the console is not compatible with the ansi code page,
> I don't see why we shouldn't change the console code pages in any case. In
> addition, I think we should add the manifest file to the app through the
> build system by default (and offer a switch to turn both the console code
> page and manifest handling off).

The problem with the manifest solution is that the application now behaves 
differently depending on which Windows version it's being run on. This means 
application developers may not see the issue their users see when they run on 
their own machines.

I don't think turning this on by default for our users is a good option. I 
would advise our users to do it (somehow) and we should do that for our own 
tools and tests.

> I think this would be mostly a positive change and won't break too many
> things for our users. Most Qt apps don't make heavy use of the 8bit APIs.
> If they do, they need to be prepared to handle different code pages anyway,
> so changing to utf8 should not break anything for them. Filenames are
> encoded in utf16 on NTFS, so setting the ACP to utf8 would make all files
> accessible by the 8bit APIs. 

The problem is not the application's cross-platform code that uses Qt. The 
problem is where they directly use the 8-bit API, especially the Windows 8-bit 
API ("A" functions). Those applications may have a lot of legacy code they've 
been carrying for decades. They may be using third-party Windows-only 
libraries that haven't been updated to deal with multibyte encodings and 
simply can't be updated. Since this class of errors has a good chance of only 
showing up on a user's machine and not the developer's, I remain skeptical 
about making it the default.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products





More information about the Development mailing list