[Development] Status on QString's UTF-8 changes

Thiago Macieira thiago.macieira at intel.com
Wed May 2 16:26:42 CEST 2012


Context: if you're reading this out of context, here's the context: we're 
changing the default encoding for QString's methods that deal with 8-bit data. 
In Qt 3 and 4, it was a variable encoding and defaulted to Latin 1 (set with 
QTextCodec::setCodecsForCStrings). In Qt 5, the variability was removed, 
leaving fromAscii == fromLatin1. We're NOT changing how QString internally 
stores data, that will remain UTF-16.

A number of commits have been accepted into Qt 5 that dealt with the encoding 
of source files. I think I caught all source code that contained non-7-bit 
characters and reencoded them to UTF-8. There are surprisingly few in Qt. I've 
also wrapped all uses of the "ascii" functions that contained Latin1 data with 
QString::fromLatin1.

The following two pending commit changes the QString 8-bit functions to use 
UTF-8, by *temporarily* changing fromAscii to mean fromUtf8, and toAscii to 
mean toUtf8.
	https://codereview.qt-project.org/24700
	https://codereview.qt-project.org/24701
	tests: https://codereview.qt-project.org/24702

They have been tested in qtbase and no regressions have been found. I do not 
believe they should cause regressions in other modules.

I'm now testing a series of changes that change fromAscii to fromUtf8, as well 
as correct one or two encoding mistakes I think I've found. Since fromAscii == 
fromUtf8 at this point in the test, the change is technically a no-op and I 
expect no regressions at all. Those changes are done for the few places in the 
code where the data seemed to be non-Latin1 in origin, as well as QString 
itself.

Next, I'll change all remaining fromAscii to fromLatin1 and toAscii to 
toLatin1. Since that's what those functions were before (still are right now 
in qtbase master), I also expect no regressions. Then I'll deprecate the Ascii 
functions.

Finally, probably starting two weeks from now when I'm back from the US, I'll 
start benchmarking and optimising the fromUtf8 function, as well as merging 
the many UTF-8 encoders and decoders in Qt (yes, we have more than one).
	
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
     Intel Sweden AB - Registration Number: 556189-6027
     Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120502/f07e29a8/attachment.sig>


More information about the Development mailing list