[Development] Qt6: Adding UTF-8 storage support to QString

Konstantin Ritt ritt.ks at gmail.com
Thu Jan 24 12:06:30 CET 2019


>  - Introduce some iterator that iterates over unicode code points.

QStringIterator


> We *should* have a string type (I don't care what you call it) that acts
> on strings indexed by Unicode characters, not in terms of a
> representation.  Whether that string type internally uses UTF-16 or
> UTF-8 should be invisible to its user.  Ideally it would be capable of
> carrying its data internally in either form (so as to avoid needless
> conversion when both producer and consumer use the same form) and of
> converting between the two (e.g. so as to append efficiently) as needed.

That's what I'd support with both hands.
However, I don't think we could do that on QString without breaking most of
the existing code.

P.S. \note Unicode operates on "code points" not "characters". And
moreover, there is no such thing like "glyph" in Unicode string.
And looking for grapheme or glyph boundary is clearly not a string
storage's or a string view's responsibility.

Regards,
Konstantin


чт, 24 янв. 2019 г. в 10:33, Olivier Goffart <olivier at woboq.com>:

> On 23.01.19 23:15, André Pönitz wrote:
> > On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote:
> >> 23.01.2019, 16:55, "Edward Welbourne" <edward.welbourne at qt.io>:
> >>> All of this discussion ignores a major elephant: QString's indexing is
> >>> by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
> >>> for a couple of decades now.
> >>>
> >>> We *should* have a string type (I don't care what you call it) that
> acts
> >>> on strings indexed by Unicode characters, not in terms of a
> >>> representation. Whether that string type internally uses UTF-16 or
> >>> UTF-8 should be invisible to its user. Ideally it would be capable of
> >>> carrying its data internally in either form (so as to avoid needless
> >>> conversion when both producer and consumer use the same form) and of
> >>> converting between the two (e.g. so as to append efficiently) as
> needed.
> >>
> >> I think this is excessive. Most common operations with strings in
> application
> >> code are:
> >>
> >> * Pass the string around or compare as an opaque token
> >> * Draw the string on screen e.g. with QPainter (while technically it
> >>    falls in the previous category, I think it's important enough to
> >>    deserve separate item)
> >> * Find substring or pattern (regex) inside the string
> >> * Split the string by character, pattern, or index boundaries found by
> means
> >>    of previous item
> >>
> >> I think the only common cases when dealing with Unicode grapheme
> clusters
> >> is required are
> >>
> >> * Handling of text cursor movement
> >> * Implementation of text shaping, i.e. what Harfbuzz is doing
> >>
> >> I think having special iterator would be quite enough for cursor case.
> Such
> >> iterator could abstract away underlying encoding, instead of forcing
> everyone
> >> to convert to UTF-16 first.
> >
> > All of that is scarily close to my opinion on the topic.
>
> Same here. I think Konstantin is spot on.
>
> Another example of good string design, I think, is the Rust's String.
> Their
> string is encoded in valid UTF-8, indexed by bytes, and splitting the
> string in
> the middle of a code point is a programmer error.
>
> As already mentioned before, UTF-16 is quite a bad choice, if it weren't
> for
> legacy.
>
> The argument of that developper wrongly using indexes cause more problem
> with
> utf-8 than with utf-16 ("it would happen for a lot more characters")
> actually
> means that the developper will see and fix their bugs quickly.
>
> I understand changing QString to UTF-8 is a difficult task if we want to
> do it
> in a compatible way. However, I think there is a way:
> In Qt5.x:
>   - Introduce some iterator that iterates over unicode code points.
>   - Deprecate utf16()  and other API that assume that QString is UTF-16
>   - Replace them by a toUtf16 which returns a QVector<ushort>.  I believe
> that
> it is possible to make the cotent implicitly shared with the QString,
> avoiding
> copies. (since it is just a QTypedArrayData internally)
>
> Then in Qt6 one can simply change the representation without breaking
> compatibility with non-deprecated functions.
>
> --
> Olivier
>
> Woboq - Qt services and support - https://woboq.com -
> https://code.woboq.org
>
>
>
>
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> https://lists.qt-project.org/listinfo/development
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20190124/0678a73b/attachment.html>


More information about the Development mailing list