[Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

Fri May 15 12:33:28 CEST 2020

> On 15 May 2020, at 03:12, Thiago Macieira <thiago.macieira at intel.com> wrote:
> 
> On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
>> Also, given a function like
>> 
>>    setFoo(const QByteArray &);
>> 
>> what does this actually expect? An UTF-8 string? A local 8-bit string?
>> An octet stream? A Latin-1 string? QByteArray is the jack of all these,
>> master of none.

What I would like to do right now for 6.0 is that all 8bit encoded text is assumed to be UTF-8. Simple as that. If it’s something else, the developer will have to take care of it himself. This is an important point for Qt 6.0 and independent of and QUtf8String we might or might not add later on.
> 
> Like that, it's just "array of bytes of an arbitrary encoding (or none)". 
> There's still a reason to have QByteArray and it'll need to exist in 
> networking and file I/O code. That means the string classes, if any, need to 
> be convertible to QByteArray anyway.

Agreed.
> 
>> So, assuming the premiss that QByteArray should not be string-ish
>> anymore, what do we want to have as the result type of QString::toUtf8()
>> and QString::toLatin1()? Do we really want mere bytes?
>> 
>> I don't think so.
> 
> Since for Qt, String = UTF-16, then anything in another encoding is "a bag of 
> bytes". QByteArray does serve that purpose.
> 
>> If Unicode succeeds, most I/O will be in the form of UTF-8. File names
>> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16
>> (as they are on Windows). It makes a _ton_ of sense to have a container
>> for this, and C++20 tempts us with char8_t to do exactly that. I'd love
>> to do string processing in UTF-8 without potentially doubling the
>> storage requirements by first converting it to UTF-16, then doing the
>> processing, then converting it back.

What are we actually gaining by having another string class? Yes, UTF-8 is being used in many places. But are the gains of directly working on UTF-8 enough to justify the duplication of all our string related APIs and implementations?
> 
> Unless you're processing Cyrillic or Greek text, in which case your memory 
> usage will be about the same. Or if you're processing CJK, in which case 
> UTF-16 is a 33% reduction in memory use.

Correct. Utf-8 only saves space for content that is mostly ascii. But if you only need ascii text processing, you can just as well do it on the current QByteArray.
> 
>> Qt should have a strong story not just for UTF-16, but also for UTF-8.
> 
> So long as it's not confusing on which class to use, sure. If that means a 
> proliferation of overloads everywhere, we've gone wrong somewhere.

+1. 

Almost all other programming languages out there have standardised on one class for unicode string/text handling. IMO this is the correct approach. The fact that we’re using UTF-16 is historical, but it’s not better or worse than UTF-8. Let’s make transcoding fast, and stop worrying about several encodings.
> 
>> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one,

I’ll veto any UTF-32 string class. There is simply not a single good reason for using such a class. The only ‘advantage’ it has is one unicode code point per index, but that doesn’t help as unicode text processing anyways needs to look beyond that (at e.g. grapheme clusters etc). And it wastes lots of memory.

>> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16
>> operations are not much slower than L1 <-> utf16 ones (I heard Lars'
>> team has them down to within 5% of each other, not sure that's
>> possible). 
> 
> The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is 
> within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The 
> difference in performance is the need to check if the high bit is set. Both 
> codecs are vectorised with both SSE2 and AVX2 implementations. There are also 
> Neon implementations, but I don't know their benchmark numbers (note: the 
> UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit).
> 
> For non-US-ASCII Latin1 text, the performance is more than 5% worse, depending 
> on how dense the non-ASCII characters are in the string. But given that we 
> want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 
> should be rare.
> 
> I also have an implementation of UTF-16 to ASCII codec, which is the same as 
> UTF-16 to Latin1, but without error checking. That requires that the string 
> class store whether it contains only US-ASCII. I've never pushed this to Qt.

Pretty much all uses of QL1String that I’ve seen are about ASCII only content. That is certainly true for Qt itself, but also to a large degree for our users. For those, utf-8 conversions are within 5% of latin1 decoding. This makes it very clear to me that we should *not* have any special handling for ascii that require a separate API.

Conversion speed for non ascii content is something we can improve, there are various BSD licensed implementations out there that can do fast conversions also for non-ascii.

>> Anyway, we'd have two class templates, and they'd just be
>> instantiated with different Char types to flesh out all of the above,
>> with the exception of the byte array ones:
>> 
>>   using QUtf8String = QBasicString<char8_t>;
>>   using QString = QBasicString<char16_t>;
>>   using QLatin1String = QBasicString<char>;
>>   (using QByteArray = QVector<std::byte>;)
> 
> BTW, I've said this before: QVector should over-allocate by one element and 
> memset it to zero, if the element is small enough (4 or 8 bytes). This should 
> be done behind the scenes, so the API would never notice it. But it would 
> allow transferring the ownership of a QByteArray's payload to any of the other 
> classes and still have a null-terminated string.

Yes, we can and should do that for Qt 6. I want zero copy conversions between QString and QList/QVector<char16_t>.
> 
> I don't mind having a QUtf8String{,View} but there needs to be a limit into 
> how much we add to its API. Do we have indexOf(char32_t) optimised with 
> vectorisation? Do we have indexOf(QRegularExpression)? The latter would make 
> us link to libpcre2-8 in addition to libpcre2-16 or would require on-the-fly 
> conversions and memory allocations. If your objective is to speed things up, 
> having too many methods may actually make it worse.

That’s why I’m rather sceptical towards introducing a QUtf8String. It will lead to a large duplication of our API similar to what we’re currently having with QL1String. Don’t forget, this does not only add to the size of our frameworks, but also to our maintenance burden.
> 
> And then there's the overload set for generic functions. I'm going to insist a 
> single, clear rule that does not depend on implementation details and is 
> reasonably future-proof. It has to be about *what* the function does, not 
> *how* it does that.

+1. 
> 
>> If, after getting all of the above runnig, we _then_ want The One String
>> (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we
>> need a QAnyString), which can contain any of the 2-4 string (view)
>> classes above (but not QByteArray(View)), but which doesn't have
>> string-ish API. Instead, you need to inspect it to extract the actual
>> string class (QLatin1String, QUtf8String, QString) contained, or simply
>> ask for the one you want, and it will convert, if necessary.
> 
> Excluding QLatin1String since I don't think we need that, I'm willing to see 
> this effort through. We need proofs of concept to show it works. And that's 
> after we decide what QUtf8String is in the first place -- and that practically 
> requires C++20.

I’d like to also see a proof of concept first. I am sceptical because I am afraid this will confuse users more than it will help. In addition, it will blow up our library code.
> 
> There isn't enough time for that before 6.0. Therefore, we need a solution for 
> the APIs that doesn't include QAnyString. So I can't take this suggestion:
> 
>> With this, your typical Qt function taking strings would look like this:
>> 
>>    QLineEdit::setText(QAnyStringView text)
[skipped implementation…]

I still wonder why this would be worthwhile doing. Would this really simplify things for our users? I somehow doubt it’s worth it.
> 
>>    Meep parseMeep(QAnyStringView str)
>>    {
>>        return str.visit([](auto str) {
>>            Meep meep;
>>            for (auto me : str.tokenize(u'\n'))
>>               meep += parse(me);
>>            return meep;
>>        });
>>    }
>> 
>> iow: instead of a bunch of overloads, you write your code as a template
>> and let QAnyStringView instantiate your lambda with the actual type of
>> string view passed.
> 
> At the cost of code size increase. More likely, our content will instead 
> convert to UTF-16 and operate on that. That's trading code size for runtime 
> memory consumption (sometimes).

Yes, as I said above I am absolutely not convinced that this would be worth the effort or the resulting additional complexity in our library.

Cheers,
Lars

> 
>>    bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept
>>    {
>>        return lhs.visit([rhs](auto lhs) {
>>            return rhs.visit([lhs](auto rhs) {
>>                return lhs == rhs;
>>            });
>>        });
>>    }
> 
> This MUST be non-inline and vectorised.
> 
> Latin1-to-UTF-16 comparisons are easy and we have them. UTF-8-to-UTF-16 not so 
> much: as I explained above, our vector code only operates on US-ASCII. That 
> might suffice for our needs.
> 
> Another problem of UTF-8 and UTF-16 comparisons is that the lengths can't be 
> directly compared, but with Latin1 and UTF-16 they can. That means this part 
> of the comparison between QString-QLatin1String can't hold:
> 
> bool QString::operator==(QLatin1String other) const noexcept
> {
>    if (size() != other.size())
>        return false;
> 
> See compareElementRecursive() in qcborvalue.cpp for the comparison 
> combinations.
>    // Officially with CBOR, we sort first the string with the shortest
>    // UTF-8 length. The length of an ASCII string is the same as its UTF-8
>    // and UTF-16 ones, but the UTF-8 length of a string is bigger than the
>    // UTF-16 equivalent. Combinations are:
>    //  1) UTF-16 and UTF-16
>    //  2) UTF-16 and UTF-8  <=== this is the problem case
>    //  3) UTF-16 and US-ASCII
>    //  4) UTF-8 and UTF-8
>    //  5) UTF-8 and US-ASCII
>    //  6) US-ASCII and US-ASCII
> 
> There are a couple of vector implementations in the Internet (see 
> branchless.org) that decode the full UTF-8 into UTF-32 in vector at rates 
> approaching 3 bytes per cycle, which is about the rate of our UTF-8 decoder 
> when run with US-ASCII content or the rate of the Latin1 decoder. If everyone 
> cares to upgrade to Intel Ice Lake processors, we can do that. For everyone 
> stuck with 2019 processors or older, there are slightly worse implementations.
> 
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel System Software Products
> 
> 
> 
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> https://lists.qt-project.org/listinfo/development