[Development] Why can't QString use UTF-8 internally?

Konstantin Ritt ritt.ks at gmail.com
Tue Feb 10 19:58:58 CET 2015


16 bits is completely enough for most spoken languages (see the
Unicode's Blocks.txt and/or Scripts.txt for an approximated list), whereas
8 bits encoding only covers ASCII.
Despite of what http://utf8everywhere.org/#conclusions says, UTF-16 is not
the worst choice; it is a trade-off between the performance and the memory
consumption in the most-common use case (spoken languages and mixed
scripts).

Konstantin

2015-02-10 21:55 GMT+04:00 Rutledge Shawn <Shawn.Rutledge at theqtcompany.com>:

>
> On Feb 10, 2015, at 17:08, Julien Blanc <julien.blanc at nmc-company.com>
> wrote:
>
> > On 10/02/2015 16:33, Knoll Lars wrote:
> >> IMO there’s simply too many questions that this one example doesn’t
> answer
> >> to conclude that what we are doing is bad.
> >
> > Two arguments :
> > - implicit sharing is convenient, and really developer friendly. It is
> > probably a good idea since strings are really present a lot in signals
> > and slots (and afaik, passed by value in these context)
> > - implicit sharing is implicit, you don’t have the choice not to pay for
> > it, which is a bad thing.
> >
> > From my experience, QStrings are slow. About two times slower than
> > using plain std::string in our use cases, but the culprit for this
> > slowness is, as far as we know, the internal 16 bits encoding, whereas
> > our data sources are all using utf-8. We have no evidence that the
> > implicit sharing cost is significant or not.
>
> Should we try to use UTF-8 in some future version of Qt?  I’ve wondered
> that for a while: 16 bits is not enough for any possible Unicode character,
> whereas 32 bits would be; and yet 8 bits is enough most of the time.  Isn’t
> 16 bits the worst choice then?  (some bloat for European languages, and
> algorithmic inefficiency for others)  With 32-bit characters, operator[] is
> always O(1).  If we use UTF-8, the code would often have to iterate a
> variable number of bytes to get to the next character.  But is it worth it
> to save memory?  Especially considering the point from earlier that
> operations on data which fit entirely within cache memory will be so much
> faster that it swamps the O(whatever) efficiency of some algorithms:
> keeping strings as small as possible should be a good thing.  And maybe
> there are some clever tricks to get faster character indexing, using
> bitfields or binary search or an occasional weak reference to a 32-bit
> decoded version when it’s really needed.  Emphasizing use of iterators
> instead of operator[] would help too.
>
> I googled "utf-8 character indexing" and the top hit was this (which I’ve
> probably seen before): http://utf8everywhere.org/
> _______________________________________________
> Development mailing list
> Development at qt-project.org
> http://lists.qt-project.org/mailman/listinfo/development
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20150210/b0c51813/attachment.html>


More information about the Development mailing list