[Development] char8_t summary?

Sat Jul 13 13:41:21 CEST 2019

On Friday, 12 July 2019 17:37:59 -03 Matthew Woehlke wrote:
> That said, I took a look at startsWith, and... surprise! It is *already
> a template*. So at least in that case, it isn't obvious why adding more
> combinations would be so terribly onerous.

Again, note how the template implicitly assumes things. A 3-character string 
cannot be present at the beginning (startsWith), end (endsWith) or anywhere in 
the middle (contains, indexOf, lastIndexOf) of a 2-character one, for example.

But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 
string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 
character). That means implementing UTF-8 functions requires different 
algorithms in the first place. That means templates are not usually the 
answer.

I'm not saying impossible. You can, by writing sufficiently generic algorithms 
that scan the strings in lockstep (you can scan UTF-8 backwards, after all). 
But the reason you don't *want* to is that our Latin1 and UTF-16 algorithms 
are optimised, often vectorised, for their purpose. We don't want to lose the 
efficiency we've already got.

And I'm not saying we shouldn't have UTF-8 algorithms or even a 
QUtf8StringView or some such. It would have helped in CBOR, for example, see 
QCborStreamWriter:
    void appendTextString(const char *utf8, qsizetype len);

This is one that should at least get the overload.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products