[Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
Marc Mutz
marc.mutz at kdab.com
Thu May 14 16:41:45 CEST 2020
Hi Lars,
On 2020-05-12 09:49, Lars Knoll wrote:
[...]
> One open question is whether we should add a QUtf8String with a
> char8_t. I am not yet convinced that we actually need the class
> though.
[...]
I positively want to stop using QByteArray as the QUtf8String that it
currently is. QByteArray should lose all notion of string-ness
(deprecate toLower() etc, remove in Qt 7) and be a QVector<std::byte>.
Not sure we'll get there for Qt 6, not sure we'll get there with the
name QByteArray, but that should be the end game for this class.
The networking code is full of uses of QByteArray and due to the lack of
QByteArrayRef (QStringRef) or QByteArrayView (QStringView), it's
splitting and substringing is much less performant than it could be.
Also, given a function like
setFoo(const QByteArray &);
what does this actually expect? An UTF-8 string? A local 8-bit string?
An octet stream? A Latin-1 string? QByteArray is the jack of all these,
master of none.
So, assuming the premiss that QByteArray should not be string-ish
anymore, what do we want to have as the result type of QString::toUtf8()
and QString::toLatin1()? Do we really want mere bytes?
I don't think so.
If Unicode succeeds, most I/O will be in the form of UTF-8. File names
on Unix are UTF-8 (for all intents and purposes these days), not UTF-16
(as they are on Windows). It makes a _ton_ of sense to have a container
for this, and C++20 tempts us with char8_t to do exactly that. I'd love
to do string processing in UTF-8 without potentially doubling the
storage requirements by first converting it to UTF-16, then doing the
processing, then converting it back.
Qt should have a strong story not just for UTF-16, but also for UTF-8.
I've talked about this on QtWS, but here's TL;DV: of it:
value_type container view string-ish
API?
char / QLatin1Char — QLatinString — QLatin1StringView — yes
char8_t / qchar8 — QUtf8String — QUtf8StringView — yes
char16_t / QChar — QString — QStringView — yes
(char32_t — QUtf32String — QUtf32StringView — yes)
std::byte — QByteArray — QByteArrayView — NO
I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one,
provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16
operations are not much slower than L1 <-> utf16 ones (I heard Lars'
team has them down to within 5% of each other, not sure that's
possible). Anyway, we'd have two class templates, and they'd just be
instantiated with different Char types to flesh out all of the above,
with the exception of the byte array ones:
using QUtf8String = QBasicString<char8_t>;
using QString = QBasicString<char16_t>;
using QLatin1String = QBasicString<char>;
(using QByteArray = QVector<std::byte>;)
If, after getting all of the above runnig, we _then_ want The One String
(View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we
need a QAnyString), which can contain any of the 2-4 string (view)
classes above (but not QByteArray(View)), but which doesn't have
string-ish API. Instead, you need to inspect it to extract the actual
string class (QLatin1String, QUtf8String, QString) contained, or simply
ask for the one you want, and it will convert, if necessary.
With this, your typical Qt function taking strings would look like this:
QLineEdit::setText(QAnyStringView text)
{
Q_D(QLineEdit);
if (text == d->text) // mixed-mode comparisons are supported out
of the box
return;
d->text = text.toString(); // centralized conversion to QString
(in library, not user code)
// also available: toLatin1(),
toUtf8()
update();
}
Callers now have total freedom in what to pass:
le->setText("Hi");
le->setText(u"Hi");
le->setText(u8"Hi");
le->setText(u"Hi"s);
le->setText(u8"Hi"sv);
le->setText(QVarLengthArray{'H', 'i'});
le->setText("Hello" % ", World"); // QStringBuilder
and they'd all result in optimal code, because QAnyStringView is a
trivial type (in the C++ sense), which means, unlike QString, it can be
passed in CPU registers instead of on the stack.
Likewise, parsing code could do
Meep parseMeep(QAnyStringView str)
{
return str.visit([](auto str) {
Meep meep;
for (auto me : str.tokenize(u'\n'))
meep += parse(me);
return meep;
});
}
iow: instead of a bunch of overloads, you write your code as a template
and let QAnyStringView instantiate your lambda with the actual type of
string view passed.
As a further example, here's op== for QAnyStringView (provided by Qt):
bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept
{
return lhs.visit([rhs](auto lhs) {
return rhs.visit([lhs](auto rhs) {
return lhs == rhs;
});
});
}
Last year, I heard someone (don't remember whom) suggest this for
QString. That is: allow QString to hold UTF-16 or UTF-8 data. I'd
classify this idea as another over-my-dead-body (which, btw, is
semi-official ISO speak for "strong objection"). As I'm wont to say: An
API doesn't become easy to use by minimizing the number of classes, but
by minimizing the number of responsibilities per class, even if that
means many more small classes than one big.
I would add, as I've done before, and even Matthew said, that I'd be
very wary of folding QStringView into QString. I can understand the urge
to not have to go and s/QString/QStringView/ in many places (or
s/QString/QAnyStringView/), but it is my firm belief that it would make
Qt much easier and convenient to use if we didn't put all those
responsibilities on QString.
There's only our own lazyness which stands in the way of this better
alternative.
Thanks,
Marc
More information about the Development
mailing list