[Development] Qt6: Adding UTF-8 storage support to QString
annulen at yandex.ru
Thu Jan 24 14:06:58 CET 2019
24.01.2019, 10:34, "Olivier Goffart" <olivier at woboq.com>:
> On 23.01.19 23:15, André Pönitz wrote:
>> On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote:
>>> 23.01.2019, 16:55, "Edward Welbourne" <edward.welbourne at qt.io>:
>>>> All of this discussion ignores a major elephant: QString's indexing is
>>>> by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
>>>> for a couple of decades now.
>>>> We *should* have a string type (I don't care what you call it) that acts
>>>> on strings indexed by Unicode characters, not in terms of a
>>>> representation. Whether that string type internally uses UTF-16 or
>>>> UTF-8 should be invisible to its user. Ideally it would be capable of
>>>> carrying its data internally in either form (so as to avoid needless
>>>> conversion when both producer and consumer use the same form) and of
>>>> converting between the two (e.g. so as to append efficiently) as needed.
>>> I think this is excessive. Most common operations with strings in application
>>> code are:
>>> * Pass the string around or compare as an opaque token
>>> * Draw the string on screen e.g. with QPainter (while technically it
>>> falls in the previous category, I think it's important enough to
>>> deserve separate item)
>>> * Find substring or pattern (regex) inside the string
>>> * Split the string by character, pattern, or index boundaries found by means
>>> of previous item
>>> I think the only common cases when dealing with Unicode grapheme clusters
>>> is required are
>>> * Handling of text cursor movement
>>> * Implementation of text shaping, i.e. what Harfbuzz is doing
>>> I think having special iterator would be quite enough for cursor case. Such
>>> iterator could abstract away underlying encoding, instead of forcing everyone
>>> to convert to UTF-16 first.
>> All of that is scarily close to my opinion on the topic.
> Same here. I think Konstantin is spot on.
> Another example of good string design, I think, is the Rust's String. Their
> string is encoded in valid UTF-8, indexed by bytes, and splitting the string in
> the middle of a code point is a programmer error.
> As already mentioned before, UTF-16 is quite a bad choice, if it weren't for
> The argument of that developper wrongly using indexes cause more problem with
> utf-8 than with utf-16 ("it would happen for a lot more characters") actually
> means that the developper will see and fix their bugs quickly.
> I understand changing QString to UTF-8 is a difficult task if we want to do it
> in a compatible way. However, I think there is a way:
> In Qt5.x:
> - Introduce some iterator that iterates over unicode code points.
> - Deprecate utf16() and other API that assume that QString is UTF-16
> - Replace them by a toUtf16 which returns a QVector<ushort>. I believe that
> it is possible to make the cotent implicitly shared with the QString, avoiding
> copies. (since it is just a QTypedArrayData internally)
I will be officially pissed off if possibility to access raw data of QString without
extra copy is gone :( It would be better if there is a way to figure out internal
storage encoding (e.g. isUtf16()) and access raw data
> Then in Qt6 one can simply change the representation without breaking
> compatibility with non-deprecated functions.
> Woboq - Qt services and support - https://woboq.com - https://code.woboq.org
> Development mailing list
> Development at qt-project.org
More information about the Development