[Development] RFC: Proposal for a semi-radical change in Qt APIs taking strings

Thu Oct 15 00:27:14 CEST 2015

On Wednesday 14 October 2015 21:51:23 Bubke Marco wrote:
> On October 14, 2015 23:10:26 Thiago Macieira <thiago.macieira at intel.com> 
wrote:
> > Do it on your own. You just said that ICU has the function you want, so
> > use
> > it.
> 
> So Qt is always shipping with ICU?

It can be disabled on Windows. On OS X there's no point since it's part of the 
system. On Linux, if you disable it, you're going to have some other features 
reduced, so don't disable it.

> > Qt does not have to provide a comparator that operates on something other
> > than its native string type.
> 
> Isn't Qt a framework to help developers? Sorry your argumentation is sounds
> not very empirical.

Yes, it is. But Qt's goal is not to support every single use-case and corner-
case out there. Qt should make 90% easy and 9% possible. That means there's a 
1% of the realm of possibilities that Qt does not address. If your use-case 
calls into this group, use the fact that Qt is native code and just call other 
libraries.

That's one of the two main advantages of native code. There's no sandbox to 
escape from.

Qt already supports doing locale-aware comparison. We even have a class for 
it, so it can be done efficiently: QCollator and it supports our native string 
type (QString).

Providing extra support for a character encoding that is not what QString uses 
falls in that 1%. Just use ICU.

> >> Maybe windows and mac os will bring support to the standard library so we
> >> don't need it but in the mean time it would be very helpful.
> >> 
> >> A utf 8 based QTextDocument would be maybe nice too.
> > 
> > What for? It needs to keep a lot of extra structures, so the cost of
> > conversion and extra memory is minimal. And besides, QTextDocument really
> > needs a seekable string, not UTF-8.
> 
> Is UTF 16 seekable? You still have surrogates and you can merge merge code
> points.

Seekable enough. It's much easier to deal with than UTF-8. A surrogate pair, 
as its name says, appears *only* in pairs, so you always know if you're on the 
first or on the second. Moreover, all living languages are encoded in the Basic 
Multilingual Plane, so no surrogate pairs are required for any of them. 
Handling of surrogate pairs can be moved to non-critical codepaths.

As for combining code points, that's something different and usually one or 
more layers removed from the seeking, along-side zero- and full-width code 
points. QTextDocument also handles fonts with variable width glyphs, so you 
can never simply convert a byte index to pixel just like that. (not to mention 
those pesky line breaks...)

> Lets describe an example. I send the QTextDocument content to an library
> which expect utf8 content and gives me back positions. This gets
> interesting if you use non ASCII signs. Actually the new clang code model
> works that way.

That example shows how UTF-16 is better. See above on seekability of UTF-16 vs 
UTF-8.

The solution for this is to fix the library to accept UTF-16. When we were 
doing Qt 5.0, we needed PCRE to support UTF-16. Their developers were very 
welcoming and wrote the version that supports UTF-16, so Qt does not need to 
reallocate.

> > Even if we provide UTF-8 support classes, those will not propagate to the
> > GUI. Forget it.
> 
> What about compressing UTF 16 like python is doing it for UTF 32. If you are
> only using ascii you set a flag and you can remove all that useless zeros.
> It would be have implications for data() but maybe we should not provide
> access to the internal representation. If you use UTF 32 as a base you
> don't need anymore surrogates.

That's what Lars called a "hybrid solution" and vetoed. I second that.

Way too much code would break if we did that because we allow people access to 
the data pointer in QString and to iterate directly (std::{,w,u16}string don't 
allow that, which makes parsing them actually a lot more cumbersome).

As for UTF-32/UCS-4, it occupies twice as much space as it needs for all text 
written with living languages.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center