[Development] Qt6: Adding UTF-8 storage support to QString
Olivier Goffart
olivier at woboq.com
Thu Jan 24 08:32:28 CET 2019
On 23.01.19 23:15, André Pönitz wrote:
> On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote:
>> 23.01.2019, 16:55, "Edward Welbourne" <edward.welbourne at qt.io>:
>>> All of this discussion ignores a major elephant: QString's indexing is
>>> by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
>>> for a couple of decades now.
>>>
>>> We *should* have a string type (I don't care what you call it) that acts
>>> on strings indexed by Unicode characters, not in terms of a
>>> representation. Whether that string type internally uses UTF-16 or
>>> UTF-8 should be invisible to its user. Ideally it would be capable of
>>> carrying its data internally in either form (so as to avoid needless
>>> conversion when both producer and consumer use the same form) and of
>>> converting between the two (e.g. so as to append efficiently) as needed.
>>
>> I think this is excessive. Most common operations with strings in application
>> code are:
>>
>> * Pass the string around or compare as an opaque token
>> * Draw the string on screen e.g. with QPainter (while technically it
>> falls in the previous category, I think it's important enough to
>> deserve separate item)
>> * Find substring or pattern (regex) inside the string
>> * Split the string by character, pattern, or index boundaries found by means
>> of previous item
>>
>> I think the only common cases when dealing with Unicode grapheme clusters
>> is required are
>>
>> * Handling of text cursor movement
>> * Implementation of text shaping, i.e. what Harfbuzz is doing
>>
>> I think having special iterator would be quite enough for cursor case. Such
>> iterator could abstract away underlying encoding, instead of forcing everyone
>> to convert to UTF-16 first.
>
> All of that is scarily close to my opinion on the topic.
Same here. I think Konstantin is spot on.
Another example of good string design, I think, is the Rust's String. Their
string is encoded in valid UTF-8, indexed by bytes, and splitting the string in
the middle of a code point is a programmer error.
As already mentioned before, UTF-16 is quite a bad choice, if it weren't for
legacy.
The argument of that developper wrongly using indexes cause more problem with
utf-8 than with utf-16 ("it would happen for a lot more characters") actually
means that the developper will see and fix their bugs quickly.
I understand changing QString to UTF-8 is a difficult task if we want to do it
in a compatible way. However, I think there is a way:
In Qt5.x:
- Introduce some iterator that iterates over unicode code points.
- Deprecate utf16() and other API that assume that QString is UTF-16
- Replace them by a toUtf16 which returns a QVector<ushort>. I believe that
it is possible to make the cotent implicitly shared with the QString, avoiding
copies. (since it is just a QTypedArrayData internally)
Then in Qt6 one can simply change the representation without breaking
compatibility with non-deprecated functions.
--
Olivier
Woboq - Qt services and support - https://woboq.com - https://code.woboq.org
More information about the Development
mailing list