[Interest] Using UTF-8 code page with Qt5 on Windows?
Thiago Macieira
thiago.macieira at intel.com
Thu May 19 06:58:22 CEST 2022
On Wednesday, 18 May 2022 20:31:55 PDT Alvin Wong wrote:
> Thanks for pointing me to the test. I compiled the test and created an
> application manifest for it to use UTF-8. It correctly detected the
> system codec to be UTF-8 and reported several failures and a bunch of
> warnings. I've attached the test output.
Thanks. It looks like you were completely right in your inspection of the
code. All of the charByChar test rows that used more than 2 characters failed.
The "nul" entry also did, but that looks like a shortcoming of the Win32 API.
> This gave me an idea: perhaps the easiest way for Qt to fix it would
> be to check `GetACP() == CP_UTF8`, and if it is true then just use
> Qt's built-in UTF-8 support and bypass MultiByteToWideChar completely.
Indeed. And given that the UTF-8 codec is highly optimised, it will be
definitely much faster. I'll make the change for Qt 6.
However, it wouldn't solve the problem for other multibyte locales that may
have more than one continuation character. A quick check of the likely
culprits reveals that:
* Chinese (CP 936) uses GBK, which is limited to two bytes
* Japanese (CP 932) uses a variant of Shift JIS, but is also two-byte only
* Korean (CP 949) uses the Unified Hangul Code, which likewise only goes up
to two bytes
Wikipedia also says that GB 2312 is the most common encoding for web pages in
Chinese, but that is also a one- or two-byte codec too. And it is no longer
used by Windows itself.
So it looks like we've never hit this problem because the codepages used by
Windows were all DBCS. It might not be worth fixing the codec implementation
then.
> > As far as I know, it already does. The Vietnamese locale for Windows has
> > been using UTF-8 for years (probably since forever) and there's no reason
> > that Qt shouldn't support it.
>
> I don't have first-hand experience with Vietnamese Windows but isn't
> Windows-1258 the Vietnamese code page? I know vaguely that
> Unicode-only locales (not Vietnamese) are a thing on Windows, but I
> thought they had no way to use UTF-8 as the ACP until the beta UTF-8
> support landed on Windows 10.
Ok, I'm probably remembering wrong. I thought it had been possible all along,
but you had to switch to a particular language (which I thought was
Vietnamese), which most users were not win a position to do.
The Wikipedia article on CP 1258 has the sentence "UTF-8 is the preferred
encoding for Vietnamese in modern applications." I guess I was misled by it
and thought it meant UTF-8 was in use on Windows.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Cloud Software Architect - Intel DCAI Cloud Engineering
More information about the Interest
mailing list