[Development] Why can't QString use UTF-8 internally?

Thu Feb 12 12:29:59 CET 2015

On 12/02/15 10:11, "Rutledge Shawn" <Shawn.Rutledge at theqtcompany.com>
wrote:

>
>On 12 Feb 2015, at 08:55, Konstantin Ritt <ritt.ks at gmail.com> wrote:
>
>> 2015-02-12 11:53 GMT+04:00 Konstantin Ritt <ritt.ks at gmail.com>:
>> 2015-02-12 11:39 GMT+04:00 Rutledge Shawn
>><Shawn.Rutledge at theqtcompany.com>:
>> 
>> On 11 Feb 2015, at 18:15, Konstantin Ritt <ritt.ks at gmail.com> wrote:
>> 
>> > FYI: Unicode codepoint != character visual representation. Moreover,
>>a single character could be represented with  a sequence of glyps or
>>vice versa - a sequence of characters could be represented with a single
>>glyph.
>> > QString (and every other Unicode string class in the world)
>>represents a sequence of Unicode codepoints (in this or that UTF), not
>>characters or glyphs - always remember that!
>> 
>> Is it impossible to convert some of the possible multi-codepoint
>>sequences into single ones, or is it just that we prefer to preserve
>>them so that when you convert back to UTF you get the same bytes with
>>which you created the QString?
>> 
>> Not sure I understand your question in context of visual representation.
>> Assume you're talking about composing the input string (though the same
>>string, composed and decomposed, would be shaped into the same sequence
>>of glyphs).
>> A while ago we decided to not change the composition form of the input
>>text and let the user to (de)compose where he needs a fixed composition
>>form, so that QString(wellformed_unicode_text).toUnicode() ==
>>wellformed_unicode_text.
>> 
>> P.S. We could re-consider this or could introduce a macro that would
>>change the composition form of a QString input but…why?
>
>It might be almost within our power to index into a QString and get a
>single, complete, renderable glyph, which is practical at least for
>rendering, and maybe for editing too.  But if we did that by storing the
>unicode that way, we’d lose this feature of being able to reproduce the
>input text exactly:
>QString(wellformed_unicode_text).toUnicode() == wellformed_unicode_text
>Consequently we have to do conversion each time we need the renderable
>text, and/or cache the results to avoid converting repeatedly.  Right?

The TrueType/OpenType font defines how to convert a unicode string into
it’s visual representation. That representation can be very different from
font to font and there simply is no predefined mapping from unicode text
to glyphs. Not even for latin text given things such as optional ff/fi
ligatures or predefined combined glyphs for certain string combinations.

I believe QString is about the best abstraction we could make for a
Unicode string. Of course one can debate about the concrete encoding of
Unicode we chose (utf16 vs utf8 vs utf32), but it’s currently a purely
academic exercise. All have advantages and disadvantages, but the fact
remains for all of them that one Unicode code point does not necessarily
always represent what you would trivially think of as one character in
your writing system. Some other writing systems are word based (e.g.
Chinese) or syllable based (e.g. parts of Japanese, indian languages).

There’s a reason the unicode specification (without data tables and
appendices) has almost a thousand pages ;-)

>But it would also be possible to go the other direction: save only UTF-8
>form in memory, so by definition we can reproduce text that was given in
>UTF-8 form.  But if the QString was constructed from text in some other
>UTF form, can we simply remember which UTF it was, convert to UTF-8, and
>then be able to reproduce it exactly by converting back from UTF-8?  And
>we still need to be able to do conversion to renderable glyphs, and maybe
>cache them.
>
>I see that QTextBoundaryFinder has
>     const QChar *chars;
>so that is nearly a cache of glyphs, right?  except that e.g. a soft
>hyphen could exist in that list, and yet may or may not be rendered
>depending where the line breaks fall?  So at the end, rendering must
>still be done iteratively by calling toNextBoundary repeatedly and
>pulling out substrings and rendering those.  (QTextBoundaryFinder doesn’t
>have a QChar grapheme() accessor.  I guess that’s done elsewhere.)  But
>some of the decisions have been made already, and embodied into that
>array of QChars.  I was wondering whether it’s worthwhile to do more work
>each time we iterate, by using UTF-8 form directly, instead of converting
>to an array of QChars first.  So the memory to store the string would be
>less, but the code to do the glyph-by-glyph iteration at rendering time
>would become more “branchy”, and that is also bad for CPU cache
>performance.
>
>Oh but there’s another way of storing glyphs: the list of QScriptItems in
>the text engine.  That looks kindof bulky too, depending how long we keep
>it around.
>
>So Unicode is a mini-language which has to be interpreted at some point
>on the way to rendering; there’s no pre-interpreted form we could store
>it in.  TrueType is also a mini-language.  Maybe it would be possible to
>write a compiler which reads UTF-8 and TrueType and writes (nearly)
>branch-free code to render a whole line or block of text, so we could
>cache code instead of data.  It could be more compact and CPU
>cache-friendly.  I imagine nobody has done that yet.  But then if you
>think about all the fancy stuff TeX can do, it could get even more
>complex than what Qt currently does.  And I don’t understand much about
>what Harfbuzz does yet, either.

See Konstantin’s answer. It’s neither possible nor desirable. And there’s
no mapping back from glyphs to Unicode.

Lars