[Development] Why can't QString use UTF-8 internally?

Wed Feb 11 21:52:09 CET 2015

On Wednesday 11 February 2015 18:26:40 Guido Seifert wrote:
> > > Yes, and he already said such example, ß becomes SS
> > 
> > The other example that was given is 'i' (UTF-8 0x69) becoming 'İ' under a
> > Turkish locale (UTF-8 0xc4 0xb0).
> 
> Ah sorry. I was too focused on the visible length. 'i' = 'İ' = 1. But of
> course I have to look at the memory usage in the string. Btw... what would
> happen in Mark's example?

Which example? Using the std::transform with ::toupper?

Well, that depends on what toupper does and whether you configured the global C 
library locale correctly.

 -- Function: int toupper (int C)
     Preliminary: | MT-Safe | AS-Safe | AC-Safe | *Note POSIX Safety
     Concepts::.

     If C is a lower-case letter, `toupper' returns the corresponding
     upper-case letter.  Otherwise C is returned unchanged.

1) toupper('i') == 'i'
	"istanbul" → "iSTANBUL"

2) toupper('i') == L'İ' == 0x130
	"istanbul" → "0STANBUL"

3) toupper('i') == 'I'
	"istanbul" → "ISTANBUL"

All solutions are wrong for Turkish.

By the way, QByteArray's toUpper and toLower are now documented to operate 
*exclusively* on Latin 1 and no locale variants apply, so i becomes I and ß/ÿ 
remain ß/ÿ. There used to be a bug in this until 5.4.0 [1].

Also, QString does not support locale-based case conversions, so "istanbul" 
always becomes "ISTANBUL" -- locale-based conversions should be in QLocale, 
but the feature is missing. At least "fußball" becomes "FUSSBALL" and ÿ gets 
properly uppercased.

[1] eef74f82db049517aa5a80e7c9456c4cbda953d1:
[...]
    Also as a consequence, this changes the handling of two characters in
    Latin 1: 'ß' should be uppercased to "SS" but we won't do it, and 'ÿ'
    can't be uppercased in Latin 1 ('Ÿ' is outside the range).
[...]

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center