[Development] Utf8 as the 8bit encoding on Windows

Lars Knoll lars.knoll at qt.io
Tue May 12 10:07:08 CEST 2020


> On 11 May 2020, at 17:08, Thiago Macieira <thiago.macieira at intel.com> wrote:
> 
> On Monday, 4 May 2020 06:46:44 PDT Lars Knoll wrote:
>> Hi all,
> 
> Hello Lars
> 
>> I would really like to make this 100% cross-platform, so I sat down and did
>> a bit of research on Windows. Things are looking pretty promising, at least
>> since Windows 10, build 1903.
> 
> Requiring that Windows version is, unfortunately, not yet acceptable. Many 
> corporations, Intel included, have a slow Windows upgrade cycle. According to 
> the WIkipedia page[1], MSFT is offering a 2½-year support cycle for each of 
> those Windows 10 releases, so we need to support pre-1903 releases until at 
> least May 11, 2021. That's past the Qt 6 release date.

Of course we can’t require that Windows version and still need to be able to handle older versions. We are however limiting ourselves to Windows 10 in Qt 6. But May 2021 is not far beyond the 6.0 release time frame, so we should plan towards making use of those things by default.
> 
> There's also Windows 8.1, which is supported until January 2023[2]

Which we said we’re not officially supporting in Qt 6.
> 
> And then there's all the unsupported Windows versions that continue to be 
> used.
> 
> [1] https://en.wikipedia.org/wiki/Windows_10_version_history
> [2] https://support.microsoft.com/en-us/help/13853/windows-lifecycle-fact-sheet
> 
>> Windows actually has two 8bit code pages that you need to take care of when
>> writing an application. There’s the application code page (which can be
>> retrieved by the GetACP() method (1).  The application code page could
>> (until build 1903) only be changed for the system as a whole, but can now
>> also be changed on a per application basis using a manifest file (2). 
> 
>> The console has a separate encoding, that can be changed programmatically
>> from within the app (3). These code pages can be different from each other,
>> making things interesting.
> 
> They almost always are, since the console codepage is the DOS codepage. For 
> most of the Western world, that's CP 850. The problem here is not the 
> application, but the terminal application. You can see the same issue if you 
> go to View > Set Encoding in Konsole and change to something other than your 
> current locale's encoding.
> 
> TBH, since Qt is not designed to deal with console applications, especially on 
> Windows, I'm going to simply say we should ignore the DOS/console codepage and 
> say "not our problem".

Qt is being used for terminal apps on Unix systems. We use it for our own tools as well. What I’m saying is that we can fix things for the Windows terminal without too much trouble on our side. If we can why shouldn’t we?
> 
> The setting of GetACP is the one we need to deal with.

This is certainly the more important one, unfortunately it’s also harder to deal with.
> 
>> *
>> SetConsoleOutputCP(CP_UTF8) also works reliably. Writing utf8 based data to
>> stdout, or even using _write(1, data, len) gives the correct output on the
>> console. This even works when writing one char at a time (ie. incomplete
>> utf8 sequences). 
>> * With our current handling, none of the test strings show up correctly on
>> the console. 
> 
> I'd say ignore the console codepage. It's been broken since inception and can 
> be fixed by a simple command-line (chcp 65001).
> 
>> * Setting
>> the applications code page to utf8 using a manifest works nicely, and
>> GetACP() returns 65001 (UTF8) instead of 1251. 
>> * setting the manifest in addition will also make qDebug() work correctly. 
>> * QTextStream still delivers mojibake in all cases. I assume there’s a bug
>> somewhere in the way we handle things in QTextStream, this needs some
>> debugging. 
> 
> Let's figure out if QTextStream has a bug after your changes to 
> QStringEncoder/Decoder land.

I did my testing with all those changes applied, but I’m happy to wait with this until they are in.
> 
>> As the code page for the console is not compatible with the ansi code page,
>> I don't see why we shouldn't change the console code pages in any case. In
>> addition, I think we should add the manifest file to the app through the
>> build system by default (and offer a switch to turn both the console code
>> page and manifest handling off).
> 
> The problem with the manifest solution is that the application now behaves 
> differently depending on which Windows version it's being run on. This means 
> application developers may not see the issue their users see when they run on 
> their own machines.

Without the manifest the app behaves differently depending on which locale the Windows machine has. This means the developer won’t see an issue that can happen on a Japanese machine on hist western developer machine. In that respect, I don’t see a fundamental difference between using the manifest and not using it. Using it, will fix things in the longer term though.
> 
> I don't think turning this on by default for our users is a good option. I 
> would advise our users to do it (somehow) and we should do that for our own 
> tools and tests.

I am not sure it would be a big problem to turn it on. But at the very minimum, we should facilitate this (add an easy way to use it in the build system) and use it ourselves.
> 
>> I think this would be mostly a positive change and won't break too many
>> things for our users. Most Qt apps don't make heavy use of the 8bit APIs.
>> If they do, they need to be prepared to handle different code pages anyway,
>> so changing to utf8 should not break anything for them. Filenames are
>> encoded in utf16 on NTFS, so setting the ACP to utf8 would make all files
>> accessible by the 8bit APIs. 
> 
> The problem is not the application's cross-platform code that uses Qt. The 
> problem is where they directly use the 8-bit API, especially the Windows 8-bit 
> API ("A" functions). Those applications may have a lot of legacy code they've 
> been carrying for decades. They may be using third-party Windows-only 
> libraries that haven't been updated to deal with multibyte encodings and 
> simply can't be updated. Since this class of errors has a good chance of only 
> showing up on a user's machine and not the developer's, I remain skeptical 
> about making it the default.

Those old libraries would fail on a Japanese windows installation as well, wouldn’t they?

Cheers,
Lars



More information about the Development mailing list