[Development] HEADS-UP: QStringLiteral

Thu Aug 22 14:47:04 CEST 2019

On 2019-08-22 13:42, Lars Knoll wrote:
>> That's why we are not removing QLatin1String: the Latin1 algorithm is 
>> as fast
>> as memcpy. The only thing better than that is zero copies.
> 
> We could also turn this around: Are we over-optimising here? Do we
> have the right balance between ease of use and performance? Converting
> utf8 is a bit more costly than latin1, but would that ever matter in
> real world use cases?

Once we have proper support for u8 (in Qt, and C++ (char8_t)), we can 
certainly think about phasing out QLatin1String. Personally, I don't 
think the decoding performance between L1 and UTF-8 is the key here.

UTF-8 even has the nice property that it's closed under all text 
transformations in all locales, unlike L1 (toupper('ß') == ẞ ∉ L1, 
tolower('I') @ tr_TR = ı ∉ L1, ...). QUtfXXX would also greatly reduce 
the number of overloads of core string functions we need to provide (the 
same way as QStringView does already, if you consider 
QT_STRINGVIEW_LEVEL >= 2).

For me, the problem is QUtf8XXX::size() - what should that return?! IOW: 
what's the meaning of an index into a UTF-8 string? That extends to 
mid(), left(), right(), split(), ... In all current Qt string classes, 
size() returns the number of characters (ignoring surrogate pairs in 
QString, which we probably can live with because there are different 
ways to spell a ä in Unicode, too (ä, a + ¨), such that any serious text 
processing is anyway far removed from the simplistic 1 code point = 1 
glyph pov, so surrogate pairs aren't much of an issue anymore). Whatever 
we do here, it will be downhill from where we are. Either size() is O(N) 
or a string (view) is no longer the size of a pointer (or two). That's 
2x (50%+0) O(1) memory per string (view), and such stuff adds up over 
1000s of strings...

So, maybe, at some point in the future, we can axe QLatin1String. But we 
need to seriously up UTF-8 support in Qt before that. QString is kind of 
in the way here, as UTF-16 has the bad side effect of endian dependence. 
If, say, .qm files were stored in UTF-8, tr() could return a QUtf8View. 
That's not possible with QString, unless apps come with two .qm files, 
one LE and BE.

One way to get out of this history pit was mentioned here and there on 
this ML before: we could have a QAnyString(View) (all names subject to 
bikeshedding), a string (view) that type-erases the encoding (like a 
std::variant<QUtf8String(View), QLatin1String(View), QString(View)>), 
which would be the type used in higher-level APIs 
(QLineEdit::setText(QAnyStringView)). I think std::filesystem::path got 
that quite right: you can feed it UTF-8 or UTF-16, and it will 
transparently convert to and from native API's encoding as needed.

But such a type has to be an _addition_ to, not a replacement of, 
encoding-dependent string types (proof: how do you process a 
QAnyString(View) if you're given one? Probably, keeping the std::variant 
simile, with a visitation mechanism, and the visitor is overloaded on 
the type. Sure, you can use (char8_t*, qsizetype) and (char16_t, 
qizetype) for that, but then we're back to a place we thought we'd never 
go back to after we got views: C-like string manipulation APIs.

Flame away...

Thanks,
Marc