[Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

Thu May 14 16:41:45 CEST 2020

Hi Lars,

On 2020-05-12 09:49, Lars Knoll wrote:
[...]
> One open question is whether we should add a QUtf8String with a
> char8_t. I am not yet convinced that we actually need the class
> though.
[...]

I positively want to stop using QByteArray as the QUtf8String that it 
currently is. QByteArray should lose all notion of string-ness 
(deprecate toLower() etc, remove in Qt 7) and be a QVector<std::byte>. 
Not sure we'll get there for Qt 6, not sure we'll get there with the 
name QByteArray, but that should be the end game for this class.

The networking code is full of uses of QByteArray and due to the lack of 
QByteArrayRef (QStringRef) or QByteArrayView (QStringView), it's 
splitting and substringing is much less performant than it could be.

Also, given a function like

    setFoo(const QByteArray &);

what does this actually expect? An UTF-8 string? A local 8-bit string? 
An octet stream? A Latin-1 string? QByteArray is the jack of all these, 
master of none.

So, assuming the premiss that QByteArray should not be string-ish 
anymore, what do we want to have as the result type of QString::toUtf8() 
and QString::toLatin1()? Do we really want mere bytes?

I don't think so.

If Unicode succeeds, most I/O will be in the form of UTF-8. File names 
on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 
(as they are on Windows). It makes a _ton_ of sense to have a container 
for this, and C++20 tempts us with char8_t to do exactly that. I'd love 
to do string processing in UTF-8 without potentially doubling the 
storage requirements by first converting it to UTF-16, then doing the 
processing, then converting it back.

Qt should have a strong story not just for UTF-16, but also for UTF-8.

I've talked about this on QtWS, but here's TL;DV: of it:

value_type              container      view                string-ish 
API?

char / QLatin1Char    — QLatinString — QLatin1StringView — yes
char8_t / qchar8      — QUtf8String  — QUtf8StringView   — yes
char16_t / QChar      — QString      — QStringView       — yes
(char32_t             — QUtf32String — QUtf32StringView  — yes)

std::byte             — QByteArray   — QByteArrayView    — NO

I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, 
provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 
operations are not much slower than L1 <-> utf16 ones (I heard Lars' 
team has them down to within 5% of each other, not sure that's 
possible). Anyway, we'd have two class templates, and they'd just be 
instantiated with different Char types to flesh out all of the above, 
with the exception of the byte array ones:

   using QUtf8String = QBasicString<char8_t>;
   using QString = QBasicString<char16_t>;
   using QLatin1String = QBasicString<char>;
   (using QByteArray = QVector<std::byte>;)

If, after getting all of the above runnig, we _then_ want The One String 
(View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we 
need a QAnyString), which can contain any of the 2-4 string (view) 
classes above (but not QByteArray(View)), but which doesn't have 
string-ish API. Instead, you need to inspect it to extract the actual 
string class (QLatin1String, QUtf8String, QString) contained, or simply 
ask for the one you want, and it will convert, if necessary.

With this, your typical Qt function taking strings would look like this:

    QLineEdit::setText(QAnyStringView text)
    {
        Q_D(QLineEdit);
        if (text == d->text) // mixed-mode comparisons are supported out 
of the box
            return;
        d->text = text.toString(); // centralized conversion to QString 
(in library, not user code)
                                   // also available: toLatin1(), 
toUtf8()
        update();
    }

Callers now have total freedom in what to pass:

    le->setText("Hi");
    le->setText(u"Hi");
    le->setText(u8"Hi");
    le->setText(u"Hi"s);
    le->setText(u8"Hi"sv);
    le->setText(QVarLengthArray{'H', 'i'});
    le->setText("Hello" % ", World"); // QStringBuilder

and they'd all result in optimal code, because QAnyStringView is a 
trivial type (in the C++ sense), which means, unlike QString, it can be 
passed in CPU registers instead of on the stack.

Likewise, parsing code could do

    Meep parseMeep(QAnyStringView str)
    {
        return str.visit([](auto str) {
            Meep meep;
            for (auto me : str.tokenize(u'\n'))
               meep += parse(me);
            return meep;
        });
    }

iow: instead of a bunch of overloads, you write your code as a template 
and let QAnyStringView instantiate your lambda with the actual type of 
string view passed.

As a further example, here's op== for QAnyStringView (provided by Qt):

    bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept
    {
        return lhs.visit([rhs](auto lhs) {
            return rhs.visit([lhs](auto rhs) {
                return lhs == rhs;
            });
        });
    }

Last year, I heard someone (don't remember whom) suggest this for 
QString. That is: allow QString to hold UTF-16 or UTF-8 data. I'd 
classify this idea as another over-my-dead-body (which, btw, is 
semi-official ISO speak for "strong objection"). As I'm wont to say: An 
API doesn't become easy to use by minimizing the number of classes, but 
by minimizing the number of responsibilities per class, even if that 
means many more small classes than one big.

I would add, as I've done before, and even Matthew said, that I'd be 
very wary of folding QStringView into QString. I can understand the urge 
to not have to go and s/QString/QStringView/ in many places (or 
s/QString/QAnyStringView/), but it is my firm belief that it would make 
Qt much easier and convenient to use if we didn't put all those 
responsibilities on QString.

There's only our own lazyness which stands in the way of this better 
alternative.

Thanks,
Marc