[Development] HEADS-UP: QStringLiteral
marc at kdab.com
Thu Aug 22 14:47:04 CEST 2019
On 2019-08-22 13:42, Lars Knoll wrote:
>> That's why we are not removing QLatin1String: the Latin1 algorithm is
>> as fast
>> as memcpy. The only thing better than that is zero copies.
> We could also turn this around: Are we over-optimising here? Do we
> have the right balance between ease of use and performance? Converting
> utf8 is a bit more costly than latin1, but would that ever matter in
> real world use cases?
Once we have proper support for u8 (in Qt, and C++ (char8_t)), we can
certainly think about phasing out QLatin1String. Personally, I don't
think the decoding performance between L1 and UTF-8 is the key here.
UTF-8 even has the nice property that it's closed under all text
transformations in all locales, unlike L1 (toupper('ß') == ẞ ∉ L1,
tolower('I') @ tr_TR = ı ∉ L1, ...). QUtfXXX would also greatly reduce
the number of overloads of core string functions we need to provide (the
same way as QStringView does already, if you consider
QT_STRINGVIEW_LEVEL >= 2).
For me, the problem is QUtf8XXX::size() - what should that return?! IOW:
what's the meaning of an index into a UTF-8 string? That extends to
mid(), left(), right(), split(), ... In all current Qt string classes,
size() returns the number of characters (ignoring surrogate pairs in
QString, which we probably can live with because there are different
ways to spell a ä in Unicode, too (ä, a + ¨), such that any serious text
processing is anyway far removed from the simplistic 1 code point = 1
glyph pov, so surrogate pairs aren't much of an issue anymore). Whatever
we do here, it will be downhill from where we are. Either size() is O(N)
or a string (view) is no longer the size of a pointer (or two). That's
2x (50%+0) O(1) memory per string (view), and such stuff adds up over
1000s of strings...
So, maybe, at some point in the future, we can axe QLatin1String. But we
need to seriously up UTF-8 support in Qt before that. QString is kind of
in the way here, as UTF-16 has the bad side effect of endian dependence.
If, say, .qm files were stored in UTF-8, tr() could return a QUtf8View.
That's not possible with QString, unless apps come with two .qm files,
one LE and BE.
One way to get out of this history pit was mentioned here and there on
this ML before: we could have a QAnyString(View) (all names subject to
bikeshedding), a string (view) that type-erases the encoding (like a
std::variant<QUtf8String(View), QLatin1String(View), QString(View)>),
which would be the type used in higher-level APIs
(QLineEdit::setText(QAnyStringView)). I think std::filesystem::path got
that quite right: you can feed it UTF-8 or UTF-16, and it will
transparently convert to and from native API's encoding as needed.
But such a type has to be an _addition_ to, not a replacement of,
encoding-dependent string types (proof: how do you process a
QAnyString(View) if you're given one? Probably, keeping the std::variant
simile, with a visitation mechanism, and the visitor is overloaded on
the type. Sure, you can use (char8_t*, qsizetype) and (char16_t,
qizetype) for that, but then we're back to a place we thought we'd never
go back to after we got views: C-like string manipulation APIs.
More information about the Development