[Development] QString and related changes for Qt 6

Tue May 12 11:01:45 CEST 2020

Lars Knoll (12 May 2020 09:49) wrote:
> My high level goal for the string classes in Qt 6 was to complete the
> Unicode related changes that we started for Qt 5.0, where we made utf8
> and utf16 the main encodings, and simplify things. I believe it’s
> important to leave the non Unicode world behind us, and offer an as
> consistent cross-platform story here as we can.

+1

> A next step is to change the build system, so that it (by default)
> assumes that source code is encoded in UTF-8. We [already] do set
> compiler flags to ensure this when building Qt itself,

... except on Windows, which we plan to fix; good plan.

> Our string handling classes currently consist of the following
> classes: QByteArray, QString, QStringView, QStringRef, QStringLiteral
> and QLatin1String. The set it too large, inconsistent and needs
> cleaning up:

Indeed, we recently documented which should be used when; doing so made
it clear that we'd be better with fewer of these, if only for the sake
of making it easier to explain !

> * QByteArray’s methods like toUpper() will only handle ASCII
>   characters (they assume Latin1 in Qt5).

We should document that doing even this is under sufferance and we wish
folk would stop using QByteArray for it.  It's an operation that
implicates the semantics of the bytes, so should be done using a class
that believes it knows the semantics of the bytes - which QByteArray
should steadfastly refuse to do.  Aim to remove at Qt 7.

> This would leave us with 4 string-related classes: QByteArray(View)
> and QString(View).

Sounds much better; and clearer.

> One open question is whether we should add a QUtf8String with a
> char8_t. I am not yet convinced that we actually need the class
> though.

How about a QUtf8View, replacing QLatin1String, as the way to pass
single-byte-encoded literals into our string APIs ?  See below.

> The next question is what we do with our API methods. Currently we
> have many places where we have three to 4 overloads for the same
> methods (taking a QString, a QStringView, a QStringRef and a
> QLatin1String). We can’t have 4 overloads for each method in all of
> Qt, so we need to restrict overloads to the places where it is
> required. IMO this is mainly the string related classes
> themselves. And even there we can probably cut down on the number of
> overloads.

I largely agree, with the exception of: supporting an 8-bit string view
type for comparisons (including startsWith(), find()/indexOf() and
similar) can save client code a factor of two on the size of many string
literals.  I'm fine with limiting its use to the QString(View) API,
though.  So QUtf8View would replace QLatin1String as that 8-bit view
type, with a much more limited scope.

While we can simply ask folk to stick a u on the front of their strings,
doubling the size of each, it would be a kindness to those with lots of
string literals to allow them to use u8 instead and avoid that doubling.
Meanwhile, the many situations where data from an outside source arrives
in UTF-8 make a case for providing a view type that can wrap such data
and make it "presentable" for interaction with QString(View), tagged
with the right semantics (i.e. the knowledge that it's UTF-8) in the
type system.

	Eddy.