[Development] Qt6: Adding UTF-8 storage support to QString

Edward Welbourne edward.welbourne at qt.io
Wed Jan 23 14:53:00 CET 2019


All of this discussion ignores a major elephant: QString's indexing is
by 16-bit UTF-16 tokens, not by Unicode characters.  We've had Unicode
for a couple of decades now.

We *should* have a string type (I don't care what you call it) that acts
on strings indexed by Unicode characters, not in terms of a
representation.  Whether that string type internally uses UTF-16 or
UTF-8 should be invisible to its user.  Ideally it would be capable of
carrying its data internally in either form (so as to avoid needless
conversion when both producer and consumer use the same form) and of
converting between the two (e.g. so as to append efficiently) as needed.

Meanwhile, buffers of data (whether 8-bit, 16-bit or of other sizes) are
types we do need in diverse places - but they should be described
differently from the sting type (call it a "text" type, if hysterical
reasons oblige us to use "string" for its encoding).  They can be
interpreted as strings, hence can serve as backing-store for a string,
provided they respect the relevant rules of a relevant encoding.

If blob[index] always returns a Unicode *character*, then blob is a
string; if it can sometimes return one half of a UTF-16 surrogate pair
(as is the case with QString today) or one byte of a multi-byte UTF-8
chunk, then blob is not really a string, it's just the storage for an
encoding of a string.

What are our chances of getting this right in Qt 6 ?
It's the 21st century - way past time we did this,

	Eddy.



More information about the Development mailing list