[Development] QAnyStringView

Tue Jun 23 11:35:05 CEST 2020

Hi,

I went to the drawing board and drew up a variant string view class. 
It's here: https://codereview.qt-project.org/c/qt/qtbase/+/301594

Here's why I think we need it. At the end of the email, I also suggest 
how we should go about introducing it into Qt.

Thiago and Lars are meanwhile convinced that we need a QUtf8tringView, 
too. Lars sees some merit for low-level APIs, Thiago remains 
unconvinced.

I have come to believe that QUtf8StringView without QAnyStringView won't 
fly: Introducing QUtf8StringView without QAnyStringView will explode the 
number of mixed-type operations we need to support. If we don't remove 
anything, we're talking about

- QString
- QStringRef*
- QStringView
- QByteArray
- QByteArrayView
- QUtf8StringView
- QLatin1String
- char16_t
- QChar
- char8_t
- char
- QLatin1Char
- const char*
- const char16_t*
- const char8_t*

and anything I've forgotten. The best we can do to condense this down is 
to revoke string-ness of QByteArray and we'd be left with

- QStringView
- QLatin1String
- QUtf8StringView
- QChar

the latter would have to accept plain char again, something we 
ASCII_DEPRECATED years ago, but should be re-considered under the new 
src-is-UTF-8 paradigm.

Lars would probably say that we could also drop QLatin1String, which 
which I disagree[1].

Assuming for the sake of argument that we need those four types, 
consider QString::replace(). Experience shows that stuff like 
QStringBuilder expressions being passed will require an actual QString 
overload to be present, too. Ignoring existing overloads and regexp, 
we'd need 5x5=25 overloads. I won't enumerate them here. What I will 
enumerate is the complete set of overloads when using QAnyStringView:

    QString& QString::replace(QAnyStringView, QAnyStringView, 
Qt::CaseSensitivity);

That's it.

Unlike QStringView, QAnyStringView is a pure interface type. I won't add 
much in the way of parsing API to it, even though I acknowledge that's a 
slippery slope. While it would be easy to add trimmed(), and tokenize() 
would be really interesting, QAnyStringView should not be used for 
parsing. That's what we have the three non-variant string view types 
for. Being a pure interface type means we can add more "dangerous" 
conversions. QStringView can't be constructed from a QStringBuilder, 
e.g., because it's almost impossible to make that work without 
referencing destroyed data:

    QStringView s = u'c' + QString::number(x); // oops
    QString c = u'c' + QString::number(x);
    QStringView s = c; // ok

But QAnyStringView supports this:

    str.replace(name, name % "_1");

In summary: 25 overloads is just way too much (and don't forget regex, 
which adds another five).

The replace() problem is also present with relational operators and 
basically wherever we have two QString arguments right now.

QAnyStringView solves this in the sense that one overload can replace 
many overloads. The complexity is still there, a binary visitation of a 
QAnyStringView produces nine instantiations of the visitor (though that 
can be reduced to six in many cases), but many implementations fall into 
one of just two classes: 1) a function would just call toString() on the 
any-string-view, anyway, in which case the QString construction is taken 
out of user code and centralized in the library. If you think that 
doesn't matter, look at the tst_qstatemachine numbers in

   https://codereview.qt-project.org/c/qt/qtbase/+/301595 (-10KiB just 
from temporary QString creation and destruction)

2) the complexity is already there and QAnyStringView helps in reducing 
it:

   https://codereview.qt-project.org/c/qt/qtbase/+/303483 (QCalendar)
   https://codereview.qt-project.org/c/qt/qtbase/+/303512 (QColor)
   https://codereview.qt-project.org/c/qt/qtbase/+/303707 (arg())
   https://codereview.qt-project.org/c/qt/qtbase/+/303708 (QUuid)

Another aspect that I'd like to mention is how QAnyStringView also helps 
with getting rid of QLatin1String for Qt 7: Instead of having QL1S 
strewn around the Qt API as we have now, we'd have just the 
QAnyStringView(QLatin1String) ctor that we'd need to deprecate.

Finally, of course, QAnyStringView increases integration of Qt with 
other C++ libraries, because it now transparently accepts almost any 
string type that exists out there (thanks to Peppe's Magic QStringView 
ctor that QUtf8tringView and QAnyStringView inherit).

I was very sceptical when some months ago someone on this ML suggested 
to make QString hold either UTF8 or UTF16 data, and I still am, but in 
an explicit variant string view type, this concept suddenly makes a lot 
of sense.

Now that I hopefully have convinced you that we need QAnyStringView, 
where to go from here?

Given the lack of time until Qt 6.0, I'd like to propose to just replace 
all overload sets that contain QL1S with one overload taking 
QAnyStringView

The implementation usually contains the optimized handling of L1 data 
already, and can often be easily extended to UTF-8, too, cf. QColor, 
QUuid, arg().

This should really happen for Qt 6, because it will greatly clean up our 
lower-level APIs and tell a consistent story.

On top of that, we can also think of replacing overloads sets that 
contain QString and (QStringView or QStringRef) with one overload taking 
QAnyStringView, or QString functions that typically get passed constants 
(like setObjectName()), but I agree with Lars that there's not enough 
time and man-power to bring this to a conclusion for Qt 6.

Thanks,
Marc

[1] First, we have a lot of existing QLatin1String use in code, both in 
Qt itself, as well as in code that has seen e.g. Clazy. Users of 
QLatin1String know why they use this class - it's either to silence 
QT_NO_CAST_FROM_ASCII or because there's a QLatin1String overload that 
they call and that prevents a QString creation. Either way, these 
developers will not react kindly if a recommended-in-Qt-5 solution 
suddenly gets either removed or heavily pessimized in Qt 6.

Second, UTF-8 is a multi-byte encoding, like UTF-16. Unlike L1 -> 
UTF-16, however, the number of code points needed to represent L1 in U8 
is not constant. That means that important optimisations like

    bool operator==(QLatin1String lhs, QStringView rhs) { return 
lhs.size() == rhs.size() && ~~~~; }

no longer work:

    bool operator==(QUtf8StringView lhs, QStringView rhs) { return 
lhs.size() == rhs.size() // NOPE!

If you think this doesn't matter, think again: it's the reason why in 
C++20 the original design of <=> was changed to only synthesize <, >, 
<=, >= and no longer also ==, !=. If you still don't believe, look at 
some if-else-chain that probably already exists somewhere (uic comes to 
mind):

     if (name == QLatin1String("name")) {
         ~~~~
     } else if (name == QLatin1String("type")) {
         ~~~~ 50 other tokens ~~~~
     } else {
         // error
     }

all of these start with a size check whereas

     if (name == "widget") {
         ~~~~
     } else if (name == "type") {
         ~~~~ 50 other tokens ~~~~
     } else {
         // error
     }

cannot. They immediately go into the strcmp loop. Now imagine there's a 
rather common prefix to all these tags...

     if (name == "qt_impl_widget") {
         ~~~~
     } else if (name == "qt_impl_type") {
         ~~~~ 50 other tokens ~~~~
     } else {
         // error
     }

and you see where I'm going.