[Development] Why can't QString use UTF-8 internally?

Wed Feb 11 17:29:13 CET 2015

On Wednesday 11 February 2015 11:22:59 Julien Blanc wrote:
> On 11/02/2015 10:32, Bo Thorsen wrote:
> > 2) length() returns the number of chars I see on the screen, not a
> > random implementation detail of the chosen encoding.
> 
> How’s that supposed to work with combining characters, which are part of
> unicode ?

That's true. And add that there are some zero-width characters too and some 
characters that are double-width.

Also, QString::length() returns the number of UTF-16 codepoints, not the 
number of UCS-4 characters, so it reports 2 characters for a pair of 
surrogates, not 1.

If you really want to know the width of the string as seen on screen, you need 
to use QFontMetrics, even for a monospace setting.

> > 3) at(int) and [] gives the unicode char, not a random encoding char.
> 
> Same problem with combining characters. What do you expect :
> 
> QString s = QString::fromWCharArray(L"n\u0303");
> s.length(); // 1 or 2 ??
> s[0]; // n or ñ ??

Yet, unlike std::u16string, QString can convert from NFD to NFC:

QString s = QString::fromUtf16(u"n\u0303")
	.normalized(QString::NormalizationForm_C);
s.length() == 1;
s[0] == "ñ";

> > Another note: Latin1 is the worst idea for i18n ever invented, and it's
> > by now useless, irrelevant and only a source for bugs once you start to
> > truly support i18n outside of USA and Western Europe. I would be one
> > step closer to total happiness if C++17 and Qt7 makes this "encoding"
> > completely unsupported.
> 
> Could not agree more with that part.

There are two reasons we keep Latin1 in the API:
 1) it's a superset of US-ASCII, so toAscii and fromAscii are just calls to 
the Latin1 functions with the note "behaviour is undefined if the string 
contains non-ASCII characters"

 2) it's dead easy to convert to and from it to UTF-16

As I was explaining yesterday to some people, the core of the loop of 
converting from Latin1 to UTF-16 is *two* AVX2 instructions:

     b36:       vpmovzxbw (%rax,%rsi,1),%ymm0
     b3c:       vmovdqu %ymm0,(%rdi,%rax,2)
	[plus the loop overhead itself]

The conversion from UTF-16 to Latin1 is a little more complex due to the 
requirement to replace non-Latin1 characters with '?', so it's a few more 
instructions with AVX-512F:

   1c60a:       vpmovzxwd 0x0(%r13,%rdx,2),%zmm2
   1c612:       vpcmpnltud %zmm1,%zmm2,%k1
   1c61d:       vpblendmd %zmm0,%zmm2,%zmm3{%k1}
   1c623:       vpmovdb %zmm3,(%rdx,%rdi,1)

Without AVX-512F (which no one has yet), it expands to more code.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center