[Development] Qt6: Adding UTF-8 storage support to QString

Wed Jan 16 22:29:00 CET 2019

On Wednesday, 16 January 2019 13:16:39 PST Konstantin Tokarev wrote:
> 1. Code points may be encoded as surrogate pairs in UTF-16, e.g. this is the
> case for Emoji characters. QString ignores this fact, indexing 16-bit
> QChars. To make things worse, several QString methods like left(), right(),
> and mid() will happily cut surrogate pair in a half.

So does QByteArray or so would an UTF-8 based QString, except it would happen 
for a lot more characters.

What you want is QTextBoundaryFinder and possible QFontMetrics.

> 2. When people are talking about character indexing they often imply
> indexing of grapheme clusters. In Unicode world grapheme cluster may be
> represented as a several code points depending on normalization form of the
> source. To make things worse, even in NFC form not every grapheme cluster
> that is possible in Unicode is representable as a single code point.

Indeed, and SG16 in the C++ Standard is looking into grapheme clusters as the 
basis unit. Unfortunately, their work does not coincide with our Qt 6 
timelines, nor would we be able to adapt that quickly based on how much code 
there is using QString.

We should pay attention to the SG16 work and make sure it works with Qt 6, 
with eyes towards a better API in Qt 7.

Nowhere did I say that we should use UTF-8.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center