[Development] Qt6: Adding UTF-8 storage support to QString

Wed Jan 23 15:07:37 CET 2019

I am not sure it would be a good idea because a glyph can be still composed of more than one code points which is language dependent. Some time you want characters, sometimes code points and sometimes glyphs etc.. Would it be not better to use a simple container and then functions on top which use a view, so we could use them with any container. So we would avoid any allocations for transforming characters from one to the other container. But anyway I think there are many usages for strings that one class to tackle all this problems is not enough.

________________________________
From: Development <development-bounces at qt-project.org> on behalf of Edward Welbourne <edward.welbourne at qt.io>
Sent: Wednesday, January 23, 2019 2:53:00 PM
To: Arnaud Clère; Thiago Macieira
Cc: development at qt-project.org
Subject: Re: [Development] Qt6: Adding UTF-8 storage support to QString

All of this discussion ignores a major elephant: QString's indexing is
by 16-bit UTF-16 tokens, not by Unicode characters.  We've had Unicode
for a couple of decades now.

We *should* have a string type (I don't care what you call it) that acts
on strings indexed by Unicode characters, not in terms of a
representation.  Whether that string type internally uses UTF-16 or
UTF-8 should be invisible to its user.  Ideally it would be capable of
carrying its data internally in either form (so as to avoid needless
conversion when both producer and consumer use the same form) and of
converting between the two (e.g. so as to append efficiently) as needed.

Meanwhile, buffers of data (whether 8-bit, 16-bit or of other sizes) are
types we do need in diverse places - but they should be described
differently from the sting type (call it a "text" type, if hysterical
reasons oblige us to use "string" for its encoding).  They can be
interpreted as strings, hence can serve as backing-store for a string,
provided they respect the relevant rules of a relevant encoding.

If blob[index] always returns a Unicode *character*, then blob is a
string; if it can sometimes return one half of a UTF-16 surrogate pair
(as is the case with QString today) or one byte of a multi-byte UTF-8
chunk, then blob is not really a string, it's just the storage for an
encoding of a string.

What are our chances of getting this right in Qt 6 ?
It's the 21st century - way past time we did this,

        Eddy.
_______________________________________________
Development mailing list
Development at qt-project.org
https://lists.qt-project.org/listinfo/development
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20190123/193e48e1/attachment.html>