[Development] Oslo, we have a problem</apollo 13> [char8_t]

Mon Jul 8 15:34:02 CEST 2019

On Monday, 8 July 2019 04:24:28 -03 Mutz, Marc via Development wrote:
> What I think when I read this is:
> 
> Backed by const char*, never implicit:
> - QLatin1String - owner of L1 data [change from today, but not a
> breaking one]
> - QLatin1StringView - what QLatin1String is now [requires porting, but
> it's just s/QLatin1String/QLatin1StringView/g in client code]
> 
> Backed by const char8_t*, implicit:
> - QUtf8String - owner of UTF-8 data
> - QUtf8StringView - view over UTF-8 data
> 
> Backed by const char16_t*, implicit (from char16_t*, Q*StringView, NOT
> from QByteArray)
> - QString - owner of UTF-16 data  [as before, possibly using char16_t
> internally to avoid the tons of ushort casts]
> - QStringView - view over UTF-16 data
> 
> Backed by const std::byte*, implicit:
> - QByteArray - owner of std::byte data, no string-like functions
> [breaking change, but anyway far in the future, as we can't depend on
> std::byte, yet]
> - QByteArrayView - view over std::byte (uchar, char, ...) data.
> 
> QByteArray, QUtf8String and QLatin1String(new) could use the same
> backend, for zero-copy transformations between them.
> 
> Is this a realistic goal for Qt 7? Last time I proposed
> QUtf8String/View, it's usefulness was challenged. I think the advent of
> char8_t in C++20 and std::byte in C++17 change the game quite a bit,
> though.

In a green field scenario, yes, that would be a realistic goal. 

I am not completely convinced of the benefit of adding of an owning UTF-8 
string class, though I very much agree with a view over UTF-8 strings. The 
reason is not the string class itself (alone it is definitely useful), but the 
fact that it would muddy the waters as to what string classes one should use 
in API. We might end up with some API using UTF-8 and some UTF-16.

But the biggest challenge is converting *every* *single* use of QLatin1String 
to QLatin1StringView. We can introduce it as a direct alias right now, at some 
point in late Qt 6 deprecate QLatin1String, at a point where people wouldn't 
be trying to keep compatibility with Qt < 5.15, then reintroduce it in Qt 7.0.

I'm not sure we should go through all that trouble for three functions. People 
don't want Latin1 case-insensitiveness, they want US-ASCII. It just so happens 
that it was easy for us to implement Latin1 in those functions that we did so.

I propose we make a documented change in behaviour in 6.0 and remove the upper 
half of the case tables of qbytearray.cpp:latin1_uppercased and 
latin1_lowercased. That would make those functions operate fully on US-ASCII 
only, which would in turn make them safe[*] for UTF-8 content too.

[*] where "safe" is defined as ASCII-insensitive and non-ASCII sensitive. 
There are some broken protocols like that, like DNS-SD (used in Zeroconf), 
which uses UTF-8 encoding over US-ASCII case-insensitive DNS.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products