[Development] Oslo, we have a problem</apollo 13> [char8_t]

Sun Jul 7 19:21:13 CEST 2019

On 06/07/2019 12:43, Mutz, Marc via Development wrote:
> C++20 is coming along, and it brings a disruptive change, one that far
> surpasses the C++17 noexcept break: u8"Hello" is now const char8_t[], no
> longer const char[].
> 
> To estimate the amount of breakage this will cause, assuming that using
> u8"" is good practice today, to indicate that a string is in UTF-8. I've
> tried to have at least QByteArray not break... and failed.

The fact that is good practice is actually questionable, SG16 reports 
that u8 encounters a very very limited adoption (and I, for one, have 
not been suggesting its usage until the C++2a situation is clarified):

> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html

> Code surveys have so far revealed little use of u8 literals

> The initial idea is simple enough: add const char8_t* overloads for
> const char* functions. This breaks passing nullptr, so you also add
> std::nullptr_t overloads. This, however, still doesn't fix the case
> where a 0 is passed. I've expected that the std::nullptr_t overload is a
> preferred match over the const char[8_t]* ones, but GCC 9.1 disagrees,
> and tells me it's still ambiguous.
> 
> So, if GCC is right, we have no way of adapting our API to not break in
> C++20. So we need to decide what to break:
> 
> a) using 0 for nullptr, or
> b) using u8"Hello" at all
> 
> The forward-looking choice would be to break (a) and support (b).

In the general case: break 0 instead of nullptr. Such code would fail 
anyhow if one starts adding e.g. overloads taking other pointer types, 
not specifically char8_t*; and adding overloads has to be acceptable in 
the general case. Plus: we already have warnings for using 0 as nullptr 
constant, and clang-tidy can automate migration. On the other hand, I'm 
not sure about MSVC.

In the specific case: are we sure it makes sense to add a char8_t 
constructor to QByteArray? Currently sits in the middle of being a pure 
"std::byte vector" (e.g. it's used to transmit raw bytes from I/O 
devices. etc) and a US-ASCII (?) string (e.g. given some of its APIs, 
like toUpper()). By no means it's a container of UTF-8 encoded strings 
and we shouldn't give the illusion that it is.

Of course there's plenty of other APIs that instead will need a 
resolution... just to name one: QString::fromUtf8.

My 2 c,
-- 
Giuseppe D'Angelo | giuseppe.dangelo at kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4329 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.qt-project.org/pipermail/development/attachments/20190707/0d538493/attachment.bin>