[Development] Why can't QString use UTF-8 internally?

Tue Feb 10 22:26:50 CET 2015

On Wednesday 11 February 2015 00:37:41 Konstantin Ritt wrote:
> Yes, that would be an ideal solution. Unfortunately, that would also break
> a LOT of existing code.
> In Qt4 times, I was doing some experiments with the QString adaptive
> storage (similar to what NSString does behind the scenes).

I've thought of this too.

This stumbles on QString's implicit sharing. If you do this:

	QString foo = "some UTF-8 text";
	QString copy = foo;
	qDebug() << foo.constData()[0];

Then the last line is invoking the const function constData(), which needs to 
return UTF-16 data. If the original QString had only UTF-8 internally, it 
wouldn't be able to since it would have to write to shared memory.

It's not insurmountable. I can think of two solutions:
 1) pre-allocate enough space for the UTF-16 data (strlen(utf8) * 2), so that 
the const functions can implicitly write to the UTF-16 block when needed. 
Since the original UTF-8 data is constant and if there are no out-of-thin-air 
values, multiple threads could do this operation simultaneously safely.

 2) indirect the UTF-16 data via an extra, atomic pointer. When a thread finds 
it needs the UTF-16 data and doesn't have it, it allocates memory, does the 
conversion, then testAndSetRelease the pointer (similar to Qt 4 
Q_GLOBAL_STATIC).

I'd choose #2 for two reasons:
 a) closer to the QString I already have for Qt 6, which inlines the "begin" 
pointer in the QString object itself.
 b) #1 has actually a bigger memory overhead than current solutions

But given the choice, I would choose to do nothing. Instead, I have a patch 
pending for Qt 6 that caches the Latin1 version of the QString in an extra 
block past the UTF-16 data.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center