[Development] HEADS-UP: QStringLiteral

Tue Aug 27 11:50:33 CEST 2019

Thiago had said:
>>> GCC and Clang default to UTF-8 *unless* you pass -finput-charset to
>>> something different, independent of what your locale is.

On Monday, 26 August 2019 09:20:49 PDT Lars Knoll wrote:
>> That wasn’t how I understood it. Here’s the corresponding man page
>> entry from gcc:
>>
>> -finput-charset=charset
>>         Set the input character set, used for translation from the
>> character set of the input file to the source character set used by
>> GCC.  If the locale does not specify, or GCC cannot get this
>> information from the locale, the default is UTF-8.  This can be
>> overridden by either the locale or this command-line option.
>> Currently the command-line option takes precedence if there's a
>> conflict.  charset can be any encoding supported by the system's
>> "iconv" library routine.
>>
>> I’m happy to be proven wrong, but to me this sounds like it’s getting
>> the file encoding from the locale, if that one specifies a charset.

Thiago Macieira (26 August 2019 19:33) replied:
> I think the documentation is wrong.
[snip]

clang, gcc read input the same with LC_ALL unset and set variously to C,
POSIX, en_US, pt_BR, el_GR.  I note that none of these explicitly
selects an encoding, so the doc above is indeed consistent with gcc
guessing UTF-8 based on the value of LC_ALL.  Even if the only el_GR or
pt_BR locale your host actually has the necessary data compiled for are
the ones using an encoding incompatible with UTF-8, gcc need not have
actually checked that if it - like QSystemLocaleData on Unix - only
looks at the value of environment variables.

> $ LC_ALL=pt_BR ls doesntexist
> ls: cannot access 'doesntexist': Arquivo ou diret�rio inexistente
> $ LC_ALL=el_GR ls doesntexist
> ls: cannot access 'doesntexist': ��� ������� ������ ������ � ���������
> $ LC_ALL=el_GR.UTF-8 ls doesntexist
> ls: cannot access 'doesntexist': Δεν υπάρχει τέτοιο αρχείο ή κατάλογος

This is a test of the translations available to ls and the assumptions
*it* makes about encodings, not the assumptions gcc and clang make.

So let's do this experiment with an explicit encoding in the locale
(which I just told the locales package to compile and set up for me):

$ LC_ALL=zh_CN.GBK gcc -S -o - -xc++ - <<<'auto s = u8"€áęǽ";' | grep -F .string
	.string	"\342\202\254\303\241\304\231\307\275"
$ gcc -S -o - -xc++ -finput-charset=GBK - <<<'auto s = u8"€áęǽ";' | grep -F .string
cc1plus: error: failure to convert GBK to UTF-8

That does, indeed, show that gcc uses UTF-8 even if the locale specifies
some other encoding; it only uses an alternate encoding if
-finput-charset over-rides it.

	Eddy.