[Development] Oslo, we have a problem</apollo 13> [char8_t]

Mon Jul 8 17:31:23 CEST 2019

On Monday, 8 July 2019 10:53:42 -03 Konstantin Ritt wrote:
> > See my reply to Marc: users want US-ASCII case-insensitive text matching
> > and
> > case folding routines, for network protocols that are US-ASCII case-
> > insensitive (DNS, IRC, etc.).
> 
> That strnicmp() and std::toupper()/std::tolower() is exactly what for.

No, those are exactly what they are NOT for.

First, those are locale-dependent and should not be used unless you control 
the locale or you specifically want to treat your 8-bit content under the 
system's locale codec. On most modern Unix systems, that's UTF-8. But it's not 
uncommon to find applications run with LC_ALL=C, which force those functions 
to US-ASCII.

And then there's tr_TR.UTF-8, which causes strnicmp("I", "i") != 0. If this is 
what you want, great. Just be careful when using it and expecting US-ASCII 
behaviour, like when parsing the IRC protocol. There used to be an old bug in 
ksirc that if you joined channel #irc, it would also join #ırc and then 
further open tabs for #Irc and #İrc depending on messages you received.

Finally, std::toupper and std::tolower are FLAWED BY DESIGN. Do not use them, 
ever. Uppercasing and lowercasing are string functions, any API that returns a 
single character is flawed. SG16 means to fix that in the new std::text 
functionality.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products