[Development] Qt6: Adding UTF-8 storage support to QString
Thiago Macieira
thiago.macieira at intel.com
Wed Jan 16 22:29:00 CET 2019
On Wednesday, 16 January 2019 13:16:39 PST Konstantin Tokarev wrote:
> 1. Code points may be encoded as surrogate pairs in UTF-16, e.g. this is the
> case for Emoji characters. QString ignores this fact, indexing 16-bit
> QChars. To make things worse, several QString methods like left(), right(),
> and mid() will happily cut surrogate pair in a half.
So does QByteArray or so would an UTF-8 based QString, except it would happen
for a lot more characters.
What you want is QTextBoundaryFinder and possible QFontMetrics.
> 2. When people are talking about character indexing they often imply
> indexing of grapheme clusters. In Unicode world grapheme cluster may be
> represented as a several code points depending on normalization form of the
> source. To make things worse, even in NFC form not every grapheme cluster
> that is possible in Unicode is representable as a single code point.
Indeed, and SG16 in the C++ Standard is looking into grapheme clusters as the
basis unit. Unfortunately, their work does not coincide with our Qt 6
timelines, nor would we be able to adapt that quickly based on how much code
there is using QString.
We should pay attention to the SG16 work and make sure it works with Qt 6,
with eyes towards a better API in Qt 7.
Nowhere did I say that we should use UTF-8.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Development
mailing list