[Development] Qt6: Adding UTF-8 storage support to QString

Thu Jan 24 14:06:58 CET 2019

24.01.2019, 10:34, "Olivier Goffart" <olivier at woboq.com>:
> On 23.01.19 23:15, André Pönitz wrote:
>>  On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote:
>>>  23.01.2019, 16:55, "Edward Welbourne" <edward.welbourne at qt.io>:
>>>>  All of this discussion ignores a major elephant: QString's indexing is
>>>>  by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
>>>>  for a couple of decades now.
>>>>
>>>>  We *should* have a string type (I don't care what you call it) that acts
>>>>  on strings indexed by Unicode characters, not in terms of a
>>>>  representation. Whether that string type internally uses UTF-16 or
>>>>  UTF-8 should be invisible to its user. Ideally it would be capable of
>>>>  carrying its data internally in either form (so as to avoid needless
>>>>  conversion when both producer and consumer use the same form) and of
>>>>  converting between the two (e.g. so as to append efficiently) as needed.
>>>
>>>  I think this is excessive. Most common operations with strings in application
>>>  code are:
>>>
>>>  * Pass the string around or compare as an opaque token
>>>  * Draw the string on screen e.g. with QPainter (while technically it
>>>     falls in the previous category, I think it's important enough to
>>>     deserve separate item)
>>>  * Find substring or pattern (regex) inside the string
>>>  * Split the string by character, pattern, or index boundaries found by means
>>>     of previous item
>>>
>>>  I think the only common cases when dealing with Unicode grapheme clusters
>>>  is required are
>>>
>>>  * Handling of text cursor movement
>>>  * Implementation of text shaping, i.e. what Harfbuzz is doing
>>>
>>>  I think having special iterator would be quite enough for cursor case. Such
>>>  iterator could abstract away underlying encoding, instead of forcing everyone
>>>  to convert to UTF-16 first.
>>
>>  All of that is scarily close to my opinion on the topic.
>
> Same here. I think Konstantin is spot on.
>
> Another example of good string design, I think, is the Rust's String. Their
> string is encoded in valid UTF-8, indexed by bytes, and splitting the string in
> the middle of a code point is a programmer error.
>
> As already mentioned before, UTF-16 is quite a bad choice, if it weren't for
> legacy.
>
> The argument of that developper wrongly using indexes cause more problem with
> utf-8 than with utf-16 ("it would happen for a lot more characters") actually
> means that the developper will see and fix their bugs quickly.
>
> I understand changing QString to UTF-8 is a difficult task if we want to do it
> in a compatible way. However, I think there is a way:
> In Qt5.x:
>   - Introduce some iterator that iterates over unicode code points.
>   - Deprecate utf16() and other API that assume that QString is UTF-16
>   - Replace them by a toUtf16 which returns a QVector<ushort>. I believe that
> it is possible to make the cotent implicitly shared with the QString, avoiding
> copies. (since it is just a QTypedArrayData internally)

I will be officially pissed off if possibility to access raw data of QString without
extra copy is gone :( It would be better if there is a way to figure out internal
storage encoding (e.g. isUtf16()) and access raw data

>
> Then in Qt6 one can simply change the representation without breaking
> compatibility with non-deprecated functions.
>
> --
> Olivier
>
> Woboq - Qt services and support - https://woboq.com - https://code.woboq.org
>
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> https://lists.qt-project.org/listinfo/development

-- 
Regards,
Konstantin