[Development] Why can't QString use UTF-8 internally?
Rutledge Shawn
Shawn.Rutledge at theqtcompany.com
Tue Feb 10 18:55:59 CET 2015
On Feb 10, 2015, at 17:08, Julien Blanc <julien.blanc at nmc-company.com> wrote:
> On 10/02/2015 16:33, Knoll Lars wrote:
>> IMO there’s simply too many questions that this one example doesn’t answer
>> to conclude that what we are doing is bad.
>
> Two arguments :
> - implicit sharing is convenient, and really developer friendly. It is
> probably a good idea since strings are really present a lot in signals
> and slots (and afaik, passed by value in these context)
> - implicit sharing is implicit, you don’t have the choice not to pay for
> it, which is a bad thing.
>
> From my experience, QStrings are slow. About two times slower than
> using plain std::string in our use cases, but the culprit for this
> slowness is, as far as we know, the internal 16 bits encoding, whereas
> our data sources are all using utf-8. We have no evidence that the
> implicit sharing cost is significant or not.
Should we try to use UTF-8 in some future version of Qt? I’ve wondered that for a while: 16 bits is not enough for any possible Unicode character, whereas 32 bits would be; and yet 8 bits is enough most of the time. Isn’t 16 bits the worst choice then? (some bloat for European languages, and algorithmic inefficiency for others) With 32-bit characters, operator[] is always O(1). If we use UTF-8, the code would often have to iterate a variable number of bytes to get to the next character. But is it worth it to save memory? Especially considering the point from earlier that operations on data which fit entirely within cache memory will be so much faster that it swamps the O(whatever) efficiency of some algorithms: keeping strings as small as possible should be a good thing. And maybe there are some clever tricks to get faster character indexing, using bitfields or binary search or an occasional weak reference to a 32-bit decoded version when it’s really needed. Emphasizing use of iterators instead of operator[] would help too.
I googled "utf-8 character indexing" and the top hit was this (which I’ve probably seen before): http://utf8everywhere.org/
More information about the Development
mailing list