[Interest] Using UTF-8 code page with Qt5 on Windows?

Thu May 19 06:58:22 CEST 2022

On Wednesday, 18 May 2022 20:31:55 PDT Alvin Wong wrote:
> Thanks for pointing me to the test. I compiled the test and created an
> application manifest for it to use UTF-8. It correctly detected the
> system codec to be UTF-8 and reported several failures and a bunch of
> warnings. I've attached the test output.

Thanks. It looks like you were completely right in your inspection of the 
code. All of the charByChar test rows that used more than 2 characters failed.

The "nul" entry also did, but that looks like a shortcoming of the Win32 API.

> This gave me an idea: perhaps the easiest way for Qt to fix it would
> be to check `GetACP() == CP_UTF8`, and if it is true then just use
> Qt's built-in UTF-8 support and bypass MultiByteToWideChar completely.

Indeed. And given that the UTF-8 codec is highly optimised, it will be 
definitely much faster. I'll make the change for Qt 6.

However, it wouldn't solve the problem for other multibyte locales that may 
have more than one continuation character. A quick check of the likely 
culprits reveals that:
 * Chinese (CP 936) uses GBK, which is limited to two bytes
 * Japanese (CP 932) uses a variant of Shift JIS, but is also two-byte only
 * Korean (CP 949) uses the Unified Hangul Code, which likewise only goes up 
   to two bytes

Wikipedia also says that GB 2312 is the most common encoding for web pages in 
Chinese, but that is also a one- or two-byte codec too. And it is no longer 
used by Windows itself.

So it looks like we've never hit this problem because the codepages used by 
Windows were all DBCS. It might not be worth fixing the codec implementation 
then.

> > As far as I know, it already does. The Vietnamese locale for Windows has
> > been using UTF-8 for years (probably since forever) and there's no reason
> > that Qt shouldn't support it.
> 
> I don't have first-hand experience with Vietnamese Windows but isn't
> Windows-1258 the Vietnamese code page? I know vaguely that
> Unicode-only locales (not Vietnamese) are a thing on Windows, but I
> thought they had no way to use UTF-8 as the ACP until the beta UTF-8
> support landed on Windows 10.

Ok, I'm probably remembering wrong. I thought it had been possible all along, 
but you had to switch to a particular language (which I thought was 
Vietnamese), which most users were not win a position to do.

The Wikipedia article on CP 1258 has the sentence "UTF-8 is the preferred 
encoding for Vietnamese in modern applications." I guess I was misled by it 
and thought it meant UTF-8 was in use on Windows.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering