[Development] Qt6: Adding UTF-8 storage support to QString

Jason H jhihn at gmx.com
Wed Jan 23 16:25:44 CET 2019


> From: "Arnaud Clère" <arnaud.clere at minmaxmedical.com>
> > And I don't want to add QUtf8String until SG16's char8_t gets settled. It'll probably be settled by C++20, which means we can probably work on this during Qt 6 lifetime, possibly even 6.1 or 6.2.
> 
> It makes sense to avoid future incompatibilities with the standard but fortunately Qt sometimes chooses to solve real problems ahead in time  ;-)

Well C++20 is really how many months away? Qt6 won't be released until when? It seems like both of these might land at the same time, except that the "by C++20" is (AFAICT) speculation. Uptake will also be slow. But by Qt being first we can get experience with the nature of the solution which might help inform the standard, or vice-versa. There's a risk we do something that conflicts with the standard in a useful way that people like, then we have fragmentation. 

Far smarter people than I have worked on this, so again burn this with fire, but my current thinking is: 
I think the problem is how all these things are implemented - they are basically escape codes, so it's impossible to say where thee current character ends and the next begins. This of course kills speed, but that's what we get for having more than one language on the planet plus emojis. It seems to me that the only real solution to keep it all fast is to progressively upgrade from bytes to the widest character and use that. This will have a scanning cost when it enters the address space if not denoted to the compiler or by the load method.  If memory is a concern, the only alternative I see is to create a complex string: "strings" are now arrays of character arrays of uniform width, and hope that it is only ever one:
"Ground control to Major Tom" - single sequence of 8 bit chars, len 27 size 27
"niños." encoded as 3 "strings", total length 6, size 7:
+ "ni" - "ni" (8 bit char sequence of 2 char)
+ "ñ" - 00000000 11110001 (UTF16 16 bit char sequence of 1 char)
+ "os." - "o" (8 bit char sequence of 3 char)

In the old days BASIC, I forget which one, but I'm remembering a Dr Dobbs or some other print medium (over 20 years ago), I read BASIC stores strings as a linked list of characters, I'm adapting that idea. There are many tradeoffs, but until we're ok with 32 bit characters, there will be tradeoffs on a multi-language planet. 

I just don't think escape codes should ever be stored in memory. Disk is fine. 

"Better to remain silent and be thought a fool than to speak and to remove all doubt." - (Disputed). I think I may have broken that rule here. "Please, be gentle." - Peter Venkman




More information about the Development mailing list