[Development] HEADS-UP: QStringLiteral

Wed Aug 28 07:21:06 CEST 2019

On Tuesday, 27 August 2019 16:57:55 PDT Kevin Kofler wrote:
> If you do not explicitly add ".UTF-8", glibc always gives you the obsolete
> legacy locale with the locale-specific pre-Unicode character set. This is
> intentional for backwards compatibility. So you should never use a locale
> without a ".UTF-8" suffix, unless, like Thiago, you want to deliberately
> test what happens in a legacy non-UTF-8 locale.
> 
> The locales are interpreted by glibc. Anything that assumes that a given
> locale uses a character set different from what glibc actually uses for that
> locale is broken. (But it looks like GCC doesn't assume anything about the
> locale and just always uses UTF-8 to begin with, contrary to what the
> documentation claims.)

Indeed. The charset can be obtained with the nl_langinfo(3) function from the 
C library. Since there's no tool to print it for us, we use Python:

$ cat langinfo.py
import locale
print(locale.nl_langinfo(locale.CODESET))
$ python3 langinfo.py
UTF-8
$ LC_ALL=C python3 langinfo.py
ANSI_X3.4-1968
$ LC_ALL=pt_BR python3 langinfo.py
ISO-8859-1
$ LC_ALL=fr_FR at euro python3 langinfo.py
ISO-8859-15
$ LC_ALL=el_GR python3 langinfo.py
ISO-8859-7
$ LC_ALL=zh_CN python3 langinfo.py
GB2312
$ LC_ALL=ja_JP python3 langinfo.py
EUC-JP

I'm *so* glad I didn't remember three of the above and hadn't had to think of 
them for 15 years. (I thought Japanese on Unix used Shift-JIS and Russian used 
KOI8-R)

Anyway, doing a memory wipe. Aside from ISO-8859-1, I don't want to think of 
any of the others for another 15 years.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products