[Development] RFC: Proposal for a semi-radical change in Qt APIs taking strings
Thiago Macieira
thiago.macieira at intel.com
Thu Oct 15 00:27:14 CEST 2015
On Wednesday 14 October 2015 21:51:23 Bubke Marco wrote:
> On October 14, 2015 23:10:26 Thiago Macieira <thiago.macieira at intel.com>
wrote:
> > Do it on your own. You just said that ICU has the function you want, so
> > use
> > it.
>
> So Qt is always shipping with ICU?
It can be disabled on Windows. On OS X there's no point since it's part of the
system. On Linux, if you disable it, you're going to have some other features
reduced, so don't disable it.
> > Qt does not have to provide a comparator that operates on something other
> > than its native string type.
>
> Isn't Qt a framework to help developers? Sorry your argumentation is sounds
> not very empirical.
Yes, it is. But Qt's goal is not to support every single use-case and corner-
case out there. Qt should make 90% easy and 9% possible. That means there's a
1% of the realm of possibilities that Qt does not address. If your use-case
calls into this group, use the fact that Qt is native code and just call other
libraries.
That's one of the two main advantages of native code. There's no sandbox to
escape from.
Qt already supports doing locale-aware comparison. We even have a class for
it, so it can be done efficiently: QCollator and it supports our native string
type (QString).
Providing extra support for a character encoding that is not what QString uses
falls in that 1%. Just use ICU.
> >> Maybe windows and mac os will bring support to the standard library so we
> >> don't need it but in the mean time it would be very helpful.
> >>
> >> A utf 8 based QTextDocument would be maybe nice too.
> >
> > What for? It needs to keep a lot of extra structures, so the cost of
> > conversion and extra memory is minimal. And besides, QTextDocument really
> > needs a seekable string, not UTF-8.
>
> Is UTF 16 seekable? You still have surrogates and you can merge merge code
> points.
Seekable enough. It's much easier to deal with than UTF-8. A surrogate pair,
as its name says, appears *only* in pairs, so you always know if you're on the
first or on the second. Moreover, all living languages are encoded in the Basic
Multilingual Plane, so no surrogate pairs are required for any of them.
Handling of surrogate pairs can be moved to non-critical codepaths.
As for combining code points, that's something different and usually one or
more layers removed from the seeking, along-side zero- and full-width code
points. QTextDocument also handles fonts with variable width glyphs, so you
can never simply convert a byte index to pixel just like that. (not to mention
those pesky line breaks...)
> Lets describe an example. I send the QTextDocument content to an library
> which expect utf8 content and gives me back positions. This gets
> interesting if you use non ASCII signs. Actually the new clang code model
> works that way.
That example shows how UTF-16 is better. See above on seekability of UTF-16 vs
UTF-8.
The solution for this is to fix the library to accept UTF-16. When we were
doing Qt 5.0, we needed PCRE to support UTF-16. Their developers were very
welcoming and wrote the version that supports UTF-16, so Qt does not need to
reallocate.
> > Even if we provide UTF-8 support classes, those will not propagate to the
> > GUI. Forget it.
>
> What about compressing UTF 16 like python is doing it for UTF 32. If you are
> only using ascii you set a flag and you can remove all that useless zeros.
> It would be have implications for data() but maybe we should not provide
> access to the internal representation. If you use UTF 32 as a base you
> don't need anymore surrogates.
That's what Lars called a "hybrid solution" and vetoed. I second that.
Way too much code would break if we did that because we allow people access to
the data pointer in QString and to iterate directly (std::{,w,u16}string don't
allow that, which makes parsing them actually a lot more cumbersome).
As for UTF-32/UCS-4, it occupies twice as much space as it needs for all text
written with living languages.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Development
mailing list