[Development] Why can't QString use UTF-8 internally?

Wed Feb 11 02:22:45 CET 2015

On Wednesday 11 February 2015 01:38:12 Olivier Goffart wrote:
> > Eh... have you tried to convert a UTF-8 or UTF-16 or UCS-4 string to the
> > locale's narrow character set without using QString?
> 
> with std::ctype::tonarrow?

That's std::ctype::narrow, which I didn't realise existed until now. But I 
have to question how anyone can think these functions are useful. A quick 
check reveals a lot of design issues:

> charT do_toupper(charT c) const;
> const charT* do_toupper(charT* low, const charT* high) const;
> 
> Effects: Converts a character or characters to upper case. The second form
> replaces each character *p in the range [low,high) for which a
> corresponding upper-case character exists, with that character.
> 
> Returns: The first form returns the corresponding upper-case character if it
> is known to exist, or its argument if not. The second form returns high.

The above does not deal with string expansion due to uppercasing (the famous 
"ß" to "SS" case). The function is flawed by design.

> charT do_widen(char c) const;
> const char* do_widen(const char* low, const char* high, charT* dest) const;
>
> Effects: Applies the simplest reasonable transformation from a char value or
> sequence of char values to the corresponding charT value or values. The
> only characters for which unique transformations are required are those in
> the basic source character set (2.3).
> For any named ctype category with a ctype<charT> facet ctc and valid
> ctype_base::mask value M, (ctc.is(M, c) || !is(M, do_widen(c)) ) is
> true.
> The second form transforms each character *p in the range [low,high),
> placing the result in dest[p-low]. 
> Returns: The first form returns the transformed value. The second form
> returns high.

Same design flaw: widening from UTF-8 to UTF-16 may cause a compression of the 
number of characters. Those functions do not return the number of characters 
output!

Oh, wait. The description is "applies the simplest reasonable transformation". 
In other words, *they don't work*.

So let's try std::codecvt instead. Not only are these classes an absolute 
horror to use, the standard says [locale.codecvt]/3:

> The specializations required in Table 81 (22.3.1.1.1) convert the
> implementation-defined native character set. codecvt<char, char, mbstate_t>
> implements a degenerate conversion; it does not convert at all. The
> specialization codecvt<char16_t, char, mbstate_t> converts between the
> UTF-16 and UTF-8 encoding schemes, and the specialization codecvt
> <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding
> schemes. codecvt<wchar_t,char,mbstate_t> converts between the native
> character sets for narrow and wide characters.

Where's "implementation-defined narrow character set" and "UTF-16"? It isn't 
there.

No, this functionality simply isn't possible with the Standard Library, as of 
C++14.

Qt has done this since Qt 2.0 (June 1999), so we're at 15 years ahead and 
counting. I would simply not trust something close to two decades behind us to 
do something they haven't begun to implement yet.

> I understand that we are used to the convenience of the API of QString, but
> it  is still a question of taste at this point. And not using the standard
> library type is a problem when it comes to integrate with others.
> If we break source compatibility and ABI consideration to accept
> std::vector,  then why not std::string and related?

Because unlike std::vector, std::basic_string is woefully inadequate compared 
to QString and QByteArray. I just mentioned the easy cases, but a quick check 
shows how much more is lacking.

I rest my case. QString will be there at least through the major release of Qt 
released before 2020.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center