[Development] Why can't QString use UTF-8 internally?

Thu Feb 12 10:11:53 CET 2015

On 12 Feb 2015, at 08:55, Konstantin Ritt <ritt.ks at gmail.com> wrote:

> 2015-02-12 11:53 GMT+04:00 Konstantin Ritt <ritt.ks at gmail.com>:
> 2015-02-12 11:39 GMT+04:00 Rutledge Shawn <Shawn.Rutledge at theqtcompany.com>:
> 
> On 11 Feb 2015, at 18:15, Konstantin Ritt <ritt.ks at gmail.com> wrote:
> 
> > FYI: Unicode codepoint != character visual representation. Moreover, a single character could be represented with  a sequence of glyps or vice versa - a sequence of characters could be represented with a single glyph.
> > QString (and every other Unicode string class in the world) represents a sequence of Unicode codepoints (in this or that UTF), not characters or glyphs - always remember that!
> 
> Is it impossible to convert some of the possible multi-codepoint sequences into single ones, or is it just that we prefer to preserve them so that when you convert back to UTF you get the same bytes with which you created the QString?
> 
> Not sure I understand your question in context of visual representation.
> Assume you're talking about composing the input string (though the same string, composed and decomposed, would be shaped into the same sequence of glyphs).
> A while ago we decided to not change the composition form of the input text and let the user to (de)compose where he needs a fixed composition form, so that QString(wellformed_unicode_text).toUnicode() == wellformed_unicode_text.
> 
> P.S. We could re-consider this or could introduce a macro that would change the composition form of a QString input but…why?

It might be almost within our power to index into a QString and get a single, complete, renderable glyph, which is practical at least for rendering, and maybe for editing too.  But if we did that by storing the unicode that way, we’d lose this feature of being able to reproduce the input text exactly:
QString(wellformed_unicode_text).toUnicode() == wellformed_unicode_text
Consequently we have to do conversion each time we need the renderable text, and/or cache the results to avoid converting repeatedly.  Right?

But it would also be possible to go the other direction: save only UTF-8 form in memory, so by definition we can reproduce text that was given in UTF-8 form.  But if the QString was constructed from text in some other UTF form, can we simply remember which UTF it was, convert to UTF-8, and then be able to reproduce it exactly by converting back from UTF-8?  And we still need to be able to do conversion to renderable glyphs, and maybe cache them.

I see that QTextBoundaryFinder has
     const QChar *chars;
so that is nearly a cache of glyphs, right?  except that e.g. a soft hyphen could exist in that list, and yet may or may not be rendered depending where the line breaks fall?  So at the end, rendering must still be done iteratively by calling toNextBoundary repeatedly and pulling out substrings and rendering those.  (QTextBoundaryFinder doesn’t have a QChar grapheme() accessor.  I guess that’s done elsewhere.)  But some of the decisions have been made already, and embodied into that array of QChars.  I was wondering whether it’s worthwhile to do more work each time we iterate, by using UTF-8 form directly, instead of converting to an array of QChars first.  So the memory to store the string would be less, but the code to do the glyph-by-glyph iteration at rendering time would become more “branchy”, and that is also bad for CPU cache performance.

Oh but there’s another way of storing glyphs: the list of QScriptItems in the text engine.  That looks kindof bulky too, depending how long we keep it around.

So Unicode is a mini-language which has to be interpreted at some point on the way to rendering; there’s no pre-interpreted form we could store it in.  TrueType is also a mini-language.  Maybe it would be possible to write a compiler which reads UTF-8 and TrueType and writes (nearly) branch-free code to render a whole line or block of text, so we could cache code instead of data.  It could be more compact and CPU cache-friendly.  I imagine nobody has done that yet.  But then if you think about all the fancy stuff TeX can do, it could get even more complex than what Qt currently does.  And I don’t understand much about what Harfbuzz does yet, either.