[Development] Qt6: Adding UTF-8 storage support to QString

Konstantin Tokarev annulen at yandex.ru
Wed Jan 23 15:40:33 CET 2019



23.01.2019, 16:55, "Edward Welbourne" <edward.welbourne at qt.io>:
> All of this discussion ignores a major elephant: QString's indexing is
> by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
> for a couple of decades now.
>
> We *should* have a string type (I don't care what you call it) that acts
> on strings indexed by Unicode characters, not in terms of a
> representation. Whether that string type internally uses UTF-16 or
> UTF-8 should be invisible to its user. Ideally it would be capable of
> carrying its data internally in either form (so as to avoid needless
> conversion when both producer and consumer use the same form) and of
> converting between the two (e.g. so as to append efficiently) as needed.

I think this is excessive. Most common operations with strings in application
code are:
* Pass the string around or compare as an opaque token
* Draw the string on screen e.g. with QPainter (while technically it falls in the
previous category, I think it's important enough to deserve separate item)
* Find substring or pattern (regex) inside the string
* Split the string by character, pattern, or index boundaries found by means
of previous item

I think the only common cases when dealing with Unicode grapheme clusters
is required are
* Handling of text cursor movement
* Implementation of text shaping, i.e. what Harfbuzz is doing

I think having special iterator would be quite enough for cursor case. Such
iterator could abstract away underlying encoding, instead of forcing everyone
to convert to UTF-16 first.

>
> Meanwhile, buffers of data (whether 8-bit, 16-bit or of other sizes) are
> types we do need in diverse places - but they should be described
> differently from the sting type (call it a "text" type, if hysterical
> reasons oblige us to use "string" for its encoding). They can be
> interpreted as strings, hence can serve as backing-store for a string,
> provided they respect the relevant rules of a relevant encoding.
>
> If blob[index] always returns a Unicode *character*, then blob is a
> string; if it can sometimes return one half of a UTF-16 surrogate pair
> (as is the case with QString today) or one byte of a multi-byte UTF-8
> chunk, then blob is not really a string, it's just the storage for an
> encoding of a string.
>
> What are our chances of getting this right in Qt 6 ?
> It's the 21st century - way past time we did this,
>
>         Eddy.
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> https://lists.qt-project.org/listinfo/development

-- 
Regards,
Konstantin




More information about the Development mailing list