[Interest] Using UTF-8 code page with Qt5 on Windows?

Thu May 19 13:28:15 CEST 2022

Hi,

> > This gave me an idea: perhaps the easiest way for Qt to fix it would
> > be to check `GetACP() == CP_UTF8`, and if it is true then just use
> > Qt's built-in UTF-8 support and bypass MultiByteToWideChar completely.
>
> Indeed. And given that the UTF-8 codec is highly optimised, it will be
> definitely much faster. I'll make the change for Qt 6.

I will patch our copy of Qt to do the same.

> However, it wouldn't solve the problem for other multibyte locales that may
> have more than one continuation character. A quick check of the likely
> culprits reveals that:
>  * Chinese (CP 936) uses GBK, which is limited to two bytes
>  * Japanese (CP 932) uses a variant of Shift JIS, but is also two-byte only
>  * Korean (CP 949) uses the Unified Hangul Code, which likewise only goes up
>    to two bytes
>
> Wikipedia also says that GB 2312 is the most common encoding for web pages in
> Chinese, but that is also a one- or two-byte codec too. And it is no longer
> used by Windows itself.
>
> So it looks like we've never hit this problem because the codepages used by
> Windows were all DBCS. It might not be worth fixing the codec implementation
> then.

Agreed. I have a suspicion that Microsoft in the past has stated
explicitly in docs that MBCS uses longer than 2 bytes per character,
and bugs of similar nature are why the UTF-8 support is opt-in and why
the system-wide support is marked as beta...

Cheers,
Alvin Wong