[Development] Why can't QString use UTF-8 internally?
Thiago Macieira
thiago.macieira at intel.com
Wed Feb 11 21:52:09 CET 2015
On Wednesday 11 February 2015 18:26:40 Guido Seifert wrote:
> > > Yes, and he already said such example, ß becomes SS
> >
> > The other example that was given is 'i' (UTF-8 0x69) becoming 'İ' under a
> > Turkish locale (UTF-8 0xc4 0xb0).
>
> Ah sorry. I was too focused on the visible length. 'i' = 'İ' = 1. But of
> course I have to look at the memory usage in the string. Btw... what would
> happen in Mark's example?
Which example? Using the std::transform with ::toupper?
Well, that depends on what toupper does and whether you configured the global C
library locale correctly.
-- Function: int toupper (int C)
Preliminary: | MT-Safe | AS-Safe | AC-Safe | *Note POSIX Safety
Concepts::.
If C is a lower-case letter, `toupper' returns the corresponding
upper-case letter. Otherwise C is returned unchanged.
1) toupper('i') == 'i'
"istanbul" → "iSTANBUL"
2) toupper('i') == L'İ' == 0x130
"istanbul" → "0STANBUL"
3) toupper('i') == 'I'
"istanbul" → "ISTANBUL"
All solutions are wrong for Turkish.
By the way, QByteArray's toUpper and toLower are now documented to operate
*exclusively* on Latin 1 and no locale variants apply, so i becomes I and ß/ÿ
remain ß/ÿ. There used to be a bug in this until 5.4.0 [1].
Also, QString does not support locale-based case conversions, so "istanbul"
always becomes "ISTANBUL" -- locale-based conversions should be in QLocale,
but the feature is missing. At least "fußball" becomes "FUSSBALL" and ÿ gets
properly uppercased.
[1] eef74f82db049517aa5a80e7c9456c4cbda953d1:
[...]
Also as a consequence, this changes the handling of two characters in
Latin 1: 'ß' should be uppercased to "SS" but we won't do it, and 'ÿ'
can't be uppercased in Latin 1 ('Ÿ' is outside the range).
[...]
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Development
mailing list