[Development] Why can't QString use UTF-8 internally?

Wed Feb 11 10:32:31 CET 2015

Den 10-02-2015 kl. 23:17 skrev Allan Sandfeld Jensen:
> On Tuesday 10 February 2015, Oswald Buddenhagen wrote:
>> On Wed, Feb 11, 2015 at 12:37:41AM +0400, Konstantin Ritt wrote:
>>> Yes, that would be an ideal solution. Unfortunately, that would also
>>> break a LOT of existing code.
>>
>> i was thinking of making it explicit with a smooth migration path - add
>> QUtf8String (basically QByteArray, but don't permit implicit conversion
>> to avoid encoding mistakes) and QUcs4String (and QUtf16String as an
>> alias for current QString - for all the windows function calls). the
>> main effort would be adding respective overloads to all our api. then
>> deprecate QString, and prune it in qt6. then maybe re-add it as an
>> alias for utf8string a few minor versions down. does that sound
>> feasible?
>>
> Maybe with C++11 we don't need QString that much anymore. Use std::string with
> UTF8 and std::u32string for UCS4.

This would make me very unhappy. I'm doing a customer project right now 
that uses std::string all over the place and there is real pain involved 
in this. It's an almost empty layer over char* and brings none of the 
features of QString. Of all the failures of the C++ standards committee, 
std::string is the worst.

Any string class has to be unicode. What it uses internally is an 
implementation detail (which is what started this thread). It's fine to 
have a pure ascii string type as well, but there are so few cases left 
in real world applications where this is useful.

What QString internally uses is a pure optimization question, and I'll 
leave that to others. But whatever is decided, I want to be sure it 
keeps some of the things QString offers:

1) Unicode! Don't assume the user remembers to use utf8. 
qlabel->setText(stdString) *will* fail. Leaving decisions on encoding to 
users is a bad idea.

2) length() returns the number of chars I see on the screen, not a 
random implementation detail of the chosen encoding.

3) at(int) and [] gives the unicode char, not a random encoding char.

std::string fails at those completely basic requirement, which is why 
you will never see me use it, unless some customer API demands it or I'm 
in one of those exceptional cases where there is sure to be ascii only 
in the strings.

Another note: Latin1 is the worst idea for i18n ever invented, and it's 
by now useless, irrelevant and only a source for bugs once you start to 
truly support i18n outside of USA and Western Europe. I would be one 
step closer to total happiness if C++17 and Qt7 makes this "encoding" 
completely unsupported.

I know this I've made the statements here a bit harsh, but I see the 
same kinds of problems again and again in customer code, when they chose 
to use std::string all over the place. They give the same arguments I've 
seen here - optimized, faster, etc - and add a few like "easier to 
switch away from Qt, backend is std/boost only and no Qt allowed and so 
on". And they pay for it in development time, bugfixing and angry users.

Sure, QString isn't optimized for some cases. But I'll take a less 
optimized class any day over something that brings heaps of bugs. Then I 
have time to focus on optimizing the serious things instead of fixing bugs.

Bo Thorsen,
Director, Viking Software.

-- 
Viking Software
Qt and C++ developers for hire
http://www.vikingsoft.eu