[Development] Qt6: Adding UTF-8 storage support to QString

Thu Jan 24 08:32:28 CET 2019

On 23.01.19 23:15, André Pönitz wrote:
> On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote:
>> 23.01.2019, 16:55, "Edward Welbourne" <edward.welbourne at qt.io>:
>>> All of this discussion ignores a major elephant: QString's indexing is
>>> by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
>>> for a couple of decades now.
>>>
>>> We *should* have a string type (I don't care what you call it) that acts
>>> on strings indexed by Unicode characters, not in terms of a
>>> representation. Whether that string type internally uses UTF-16 or
>>> UTF-8 should be invisible to its user. Ideally it would be capable of
>>> carrying its data internally in either form (so as to avoid needless
>>> conversion when both producer and consumer use the same form) and of
>>> converting between the two (e.g. so as to append efficiently) as needed.
>>
>> I think this is excessive. Most common operations with strings in application
>> code are:
>>
>> * Pass the string around or compare as an opaque token
>> * Draw the string on screen e.g. with QPainter (while technically it
>>    falls in the previous category, I think it's important enough to
>>    deserve separate item)
>> * Find substring or pattern (regex) inside the string
>> * Split the string by character, pattern, or index boundaries found by means
>>    of previous item
>>
>> I think the only common cases when dealing with Unicode grapheme clusters
>> is required are
>>
>> * Handling of text cursor movement
>> * Implementation of text shaping, i.e. what Harfbuzz is doing
>>
>> I think having special iterator would be quite enough for cursor case. Such
>> iterator could abstract away underlying encoding, instead of forcing everyone
>> to convert to UTF-16 first.
> 
> All of that is scarily close to my opinion on the topic.

Same here. I think Konstantin is spot on.

Another example of good string design, I think, is the Rust's String. Their 
string is encoded in valid UTF-8, indexed by bytes, and splitting the string in 
the middle of a code point is a programmer error.

As already mentioned before, UTF-16 is quite a bad choice, if it weren't for 
legacy.

The argument of that developper wrongly using indexes cause more problem with 
utf-8 than with utf-16 ("it would happen for a lot more characters") actually 
means that the developper will see and fix their bugs quickly.

I understand changing QString to UTF-8 is a difficult task if we want to do it 
in a compatible way. However, I think there is a way:
In Qt5.x:
  - Introduce some iterator that iterates over unicode code points.
  - Deprecate utf16()  and other API that assume that QString is UTF-16
  - Replace them by a toUtf16 which returns a QVector<ushort>.  I believe that 
it is possible to make the cotent implicitly shared with the QString, avoiding 
copies. (since it is just a QTypedArrayData internally)

Then in Qt6 one can simply change the representation without breaking 
compatibility with non-deprecated functions.

-- 
Olivier

Woboq - Qt services and support - https://woboq.com - https://code.woboq.org