[Development] Why can't QString use UTF-8 internally?
Thiago Macieira
thiago.macieira at intel.com
Wed Feb 11 17:29:13 CET 2015
On Wednesday 11 February 2015 11:22:59 Julien Blanc wrote:
> On 11/02/2015 10:32, Bo Thorsen wrote:
> > 2) length() returns the number of chars I see on the screen, not a
> > random implementation detail of the chosen encoding.
>
> How’s that supposed to work with combining characters, which are part of
> unicode ?
That's true. And add that there are some zero-width characters too and some
characters that are double-width.
Also, QString::length() returns the number of UTF-16 codepoints, not the
number of UCS-4 characters, so it reports 2 characters for a pair of
surrogates, not 1.
If you really want to know the width of the string as seen on screen, you need
to use QFontMetrics, even for a monospace setting.
> > 3) at(int) and [] gives the unicode char, not a random encoding char.
>
> Same problem with combining characters. What do you expect :
>
> QString s = QString::fromWCharArray(L"n\u0303");
> s.length(); // 1 or 2 ??
> s[0]; // n or ñ ??
Yet, unlike std::u16string, QString can convert from NFD to NFC:
QString s = QString::fromUtf16(u"n\u0303")
.normalized(QString::NormalizationForm_C);
s.length() == 1;
s[0] == "ñ";
> > Another note: Latin1 is the worst idea for i18n ever invented, and it's
> > by now useless, irrelevant and only a source for bugs once you start to
> > truly support i18n outside of USA and Western Europe. I would be one
> > step closer to total happiness if C++17 and Qt7 makes this "encoding"
> > completely unsupported.
>
> Could not agree more with that part.
There are two reasons we keep Latin1 in the API:
1) it's a superset of US-ASCII, so toAscii and fromAscii are just calls to
the Latin1 functions with the note "behaviour is undefined if the string
contains non-ASCII characters"
2) it's dead easy to convert to and from it to UTF-16
As I was explaining yesterday to some people, the core of the loop of
converting from Latin1 to UTF-16 is *two* AVX2 instructions:
b36: vpmovzxbw (%rax,%rsi,1),%ymm0
b3c: vmovdqu %ymm0,(%rdi,%rax,2)
[plus the loop overhead itself]
The conversion from UTF-16 to Latin1 is a little more complex due to the
requirement to replace non-Latin1 characters with '?', so it's a few more
instructions with AVX-512F:
1c60a: vpmovzxwd 0x0(%r13,%rdx,2),%zmm2
1c612: vpcmpnltud %zmm1,%zmm2,%k1
1c61d: vpblendmd %zmm0,%zmm2,%zmm3{%k1}
1c623: vpmovdb %zmm3,(%rdx,%rdi,1)
Without AVX-512F (which no one has yet), it expands to more code.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Development
mailing list