[Development] char8_t summary?

Thu Jul 11 11:05:02 CEST 2019

On 2019-07-11 10:13, André Pönitz wrote:
> On Wed, Jul 10, 2019 at 10:01:04PM -0300, Thiago Macieira wrote:
>> On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote:
>> > As far as I understand there's a perceived need to have "full" utf8
>> > literals, and there's a need to have ASCII literals. First could be
>> > served by some QUtf8*, second by QAscii*, both additions, no need to
>> > change QLatin* semantics.
>> 
>> ASCII = Latin1
> 
> bool = char ?
> 
> circle = ellipse ?
> 
> It's a subset, it is special enough to be called by its name. 
> Especially
> if it has features (e.g. toUpper/toLower operating on single letters) 
> that
> are not present in the larger set.
> 
> The line of discussion here is
> 
>   - people (correctly, happily) use toUpper on (7-bit clean US-ASCII) 
> data
>   - ASCII is claimed to be identical to Latin1
>   - since it is identical it is superfluous to have both and ASCII is 
> dropped
>   - toUpper does not work per-char for Latin1 in corner cases
>   - so it needs to be dropped "to avoid wrong use"
> 

There is a cost associated with another string class, too, and it's 
combinatorial explosion. Even when we have all view types 
(QLatin1StringView, QUtf8StringView, QStringView), consider the overload 
set of QString::replace(), ignoring the (ptr, size) variants:

    {QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar}

that's 16 overloads. And that's without a possible QUtf32StringView. 
Ditto for the relational operators. Add QAsciiStringView and you're up 
to 25. Mind you, this is the math for the end game: no more const char*, 
const char8_t*, and (ptr, size) overloads as they've all been subsumed 
by their corresponding views. We'll be there, maybe, come Qt 7. The math 
is even worse until then.

> In the end this deprives users from a useful tool in a scenario where 
> it
> was perfectly fine to use.

I don't see how. Users will be able to use QU8V or QL1V's toUppper() and 
they'll just work for US-ASCII. The L1 algorithm can be coded such that 
only ß and \xFF are on a slow path. Or maybe it's the case that 
toUpper() doesn't extend the length of UTF-8-encoded text? Maybe we're 
lucky and Unicode finally gets that the capital letter ß isn't SS, but 
ẞ, and we can then just document that if the capital letter isn't 
representable in L1, then it stays unchanged.

I'm still not convinced that QAsciiString is needed for any of this.

Thanks,
Marc