[Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

Fri May 15 03:12:15 CEST 2020

On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
> Also, given a function like
> 
>     setFoo(const QByteArray &);
> 
> what does this actually expect? An UTF-8 string? A local 8-bit string?
> An octet stream? A Latin-1 string? QByteArray is the jack of all these,
> master of none.

Like that, it's just "array of bytes of an arbitrary encoding (or none)". 
There's still a reason to have QByteArray and it'll need to exist in 
networking and file I/O code. That means the string classes, if any, need to 
be convertible to QByteArray anyway.

> So, assuming the premiss that QByteArray should not be string-ish
> anymore, what do we want to have as the result type of QString::toUtf8()
> and QString::toLatin1()? Do we really want mere bytes?
> 
> I don't think so.

Since for Qt, String = UTF-16, then anything in another encoding is "a bag of 
bytes". QByteArray does serve that purpose.

> If Unicode succeeds, most I/O will be in the form of UTF-8. File names
> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16
> (as they are on Windows). It makes a _ton_ of sense to have a container
> for this, and C++20 tempts us with char8_t to do exactly that. I'd love
> to do string processing in UTF-8 without potentially doubling the
> storage requirements by first converting it to UTF-16, then doing the
> processing, then converting it back.

Unless you're processing Cyrillic or Greek text, in which case your memory 
usage will be about the same. Or if you're processing CJK, in which case 
UTF-16 is a 33% reduction in memory use.

> Qt should have a strong story not just for UTF-16, but also for UTF-8.

So long as it's not confusing on which class to use, sure. If that means a 
proliferation of overloads everywhere, we've gone wrong somewhere.

> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one,
> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16
> operations are not much slower than L1 <-> utf16 ones (I heard Lars'
> team has them down to within 5% of each other, not sure that's
> possible). 

The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is 
within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The 
difference in performance is the need to check if the high bit is set. Both 
codecs are vectorised with both SSE2 and AVX2 implementations. There are also 
Neon implementations, but I don't know their benchmark numbers (note: the 
UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit).

For non-US-ASCII Latin1 text, the performance is more than 5% worse, depending 
on how dense the non-ASCII characters are in the string. But given that we 
want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 
should be rare.

I also have an implementation of UTF-16 to ASCII codec, which is the same as 
UTF-16 to Latin1, but without error checking. That requires that the string 
class store whether it contains only US-ASCII. I've never pushed this to Qt.

> Anyway, we'd have two class templates, and they'd just be
> instantiated with different Char types to flesh out all of the above,
> with the exception of the byte array ones:
> 
>    using QUtf8String = QBasicString<char8_t>;
>    using QString = QBasicString<char16_t>;
>    using QLatin1String = QBasicString<char>;
>    (using QByteArray = QVector<std::byte>;)

BTW, I've said this before: QVector should over-allocate by one element and 
memset it to zero, if the element is small enough (4 or 8 bytes). This should 
be done behind the scenes, so the API would never notice it. But it would 
allow transferring the ownership of a QByteArray's payload to any of the other 
classes and still have a null-terminated string.

I don't mind having a QUtf8String{,View} but there needs to be a limit into 
how much we add to its API. Do we have indexOf(char32_t) optimised with 
vectorisation? Do we have indexOf(QRegularExpression)? The latter would make 
us link to libpcre2-8 in addition to libpcre2-16 or would require on-the-fly 
conversions and memory allocations. If your objective is to speed things up, 
having too many methods may actually make it worse.

And then there's the overload set for generic functions. I'm going to insist a 
single, clear rule that does not depend on implementation details and is 
reasonably future-proof. It has to be about *what* the function does, not 
*how* it does that.

> If, after getting all of the above runnig, we _then_ want The One String
> (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we
> need a QAnyString), which can contain any of the 2-4 string (view)
> classes above (but not QByteArray(View)), but which doesn't have
> string-ish API. Instead, you need to inspect it to extract the actual
> string class (QLatin1String, QUtf8String, QString) contained, or simply
> ask for the one you want, and it will convert, if necessary.

Excluding QLatin1String since I don't think we need that, I'm willing to see 
this effort through. We need proofs of concept to show it works. And that's 
after we decide what QUtf8String is in the first place -- and that practically 
requires C++20.

There isn't enough time for that before 6.0. Therefore, we need a solution for 
the APIs that doesn't include QAnyString. So I can't take this suggestion:

> With this, your typical Qt function taking strings would look like this:
> 
>     QLineEdit::setText(QAnyStringView text)

>     Meep parseMeep(QAnyStringView str)
>     {
>         return str.visit([](auto str) {
>             Meep meep;
>             for (auto me : str.tokenize(u'\n'))
>                meep += parse(me);
>             return meep;
>         });
>     }
> 
> iow: instead of a bunch of overloads, you write your code as a template
> and let QAnyStringView instantiate your lambda with the actual type of
> string view passed.

At the cost of code size increase. More likely, our content will instead 
convert to UTF-16 and operate on that. That's trading code size for runtime 
memory consumption (sometimes).

>     bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept
>     {
>         return lhs.visit([rhs](auto lhs) {
>             return rhs.visit([lhs](auto rhs) {
>                 return lhs == rhs;
>             });
>         });
>     }

This MUST be non-inline and vectorised.

Latin1-to-UTF-16 comparisons are easy and we have them. UTF-8-to-UTF-16 not so 
much: as I explained above, our vector code only operates on US-ASCII. That 
might suffice for our needs.

Another problem of UTF-8 and UTF-16 comparisons is that the lengths can't be 
directly compared, but with Latin1 and UTF-16 they can. That means this part 
of the comparison between QString-QLatin1String can't hold:

bool QString::operator==(QLatin1String other) const noexcept
{
    if (size() != other.size())
        return false;

See compareElementRecursive() in qcborvalue.cpp for the comparison 
combinations.
    // Officially with CBOR, we sort first the string with the shortest
    // UTF-8 length. The length of an ASCII string is the same as its UTF-8
    // and UTF-16 ones, but the UTF-8 length of a string is bigger than the
    // UTF-16 equivalent. Combinations are:
    //  1) UTF-16 and UTF-16
    //  2) UTF-16 and UTF-8  <=== this is the problem case
    //  3) UTF-16 and US-ASCII
    //  4) UTF-8 and UTF-8
    //  5) UTF-8 and US-ASCII
    //  6) US-ASCII and US-ASCII

There are a couple of vector implementations in the Internet (see 
branchless.org) that decode the full UTF-8 into UTF-32 in vector at rates 
approaching 3 bytes per cycle, which is about the rate of our UTF-8 decoder 
when run with US-ASCII content or the rate of the Latin1 decoder. If everyone 
cares to upgrade to Intel Ice Lake processors, we can do that. For everyone 
stuck with 2019 processors or older, there are slightly worse implementations.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products