[Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

Fri Mar 15 19:09:20 CET 2019

On 15/03/2019 08.27, Giuseppe D'Angelo via Development wrote:
> Il 14/03/19 22:48, Thiago Macieira ha scritto:
>> For
>>
>>    char16_t text1[] = u"" "\u0102";
>>
>> It produces, without /utf-8 (seehttps://msvc.godbolt.org/z/EvtKzq):
>>
>> ?text1@@3PA_SA DB '?', 00H, 00H, 00H                    ; text1
>>
>> And with /utf-8:
>>
>> ?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H        ; text1
>>
>> Those two values make no sense. U+0102 is neither 0x003f (question 
>> mark) nor 0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An
>> interpretation of the C++11 standard could say that the translation
>> is correct for the no-/utf-8 build, 

In fact, I now believe that to be the case (if unfortunate); note
[lex.phases]¶1.5 and also
https://groups.google.com/a/isocpp.org/d/msg/std-discussion/qYf6treuLmY/EeLI6bqTCwAJ.

>> but with /utf-8 or /execution-charset:utf-8 it should have produced
>> the correct result.
> 
> Actually, those values have a somehow connection with the input. Looks
> like MSVC is double-encoding it:
> 
> * "\u0102" under UTF-8 execution charset produces a string containing
> 0xC4 0x82;
> 
> * that string literal is a generic narrow string literal (non prefixed).
> When concatenating to a u-prefixed string literal, somehow MSVC thinks
> it's in its native codepage instead of UTF-8...

*That* smells buggy. I think I'll stick to /we4566 and adding the extra
'u' if my QStringLiteral is non-ASCII so that I'm not hitting this case.

> The mapping of \u escape sequences to the execution character set
> happens before string literal concatenation (translation phases 5/6).
> But AFAIU the mapping is purely symbolic, and has nothing to do with any
> actual encoding, so MSVC is at fault here?

Why do you think it's "symbolic"? The standard clearly says "if there is
no corresponding member [of the target character set], [the character]
is converted to an implementation-defined member". That's obviously the
case for the characters in question, so they get mapped to '?'.

AFAICT, in my example (execution character set == CP-1252), MSVC is
doing what the standard requires it to do. It's unfortunate that this
isn't what the user wanted, but I don't see a "solution" except to swap
phases 5 and 6. (But again, this does *not* apply to the ECS == UTF-8 case.)

(Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ##
str`, but that's SIC.)

-- 
Matthew