[Development] std::format support for Qt string types

Fri Jun 7 11:54:50 CEST 2024

Peppe had said:
>> I'm not following this. If I do
>>
>>  std::format("{} {}", utf8string, latin1string)
>>
>> what am I supposed to get out? A string which is a mix of two different
>> encodings? I don't think that's ever possibly wanted.

Ivan Solovev (7 June 2024 10:53) replied:
> Yes, that's exactly what I mean. And, by the way, that's exactly how
> std::format is working now.
> If you write something like this:
>
>  std::string utf8{"\xC3\x84\xC3\x96\xC3\x9C"}; // ÄÖÜ in UTF-8
>  std::string latin1{"\xC4\xD6\xDC"}; // ÄÖÜ in Latin1
>  std::string buffer;
>  std::format_to(std::back_inserter(buffer), "{} {}", utf8, latin1);
>
> Then the resulting buffer will simply contain
> "\xC3\x84\xC3\x96\xC3\x9C \xC4\xD6\xDC".
>
> So, std::format does not care about the encodings.

To be fair, std::format is given no information above as to the encoding
of the two strings - it doesn't know the names you've given the
variables, only their values - so this isn't really comparable to the
case where Qt code uses QLatin1StringView and QUtf8StringView; their
types tell Qt what encoding they're using.

My guess is that std::string is really the equivalent of QByteArray, not
a String in Qt's use of the term (i.e. it doesn't know its encoding).

If the standard does not address the question of encodings, I suggest we
(via Ville) poke them about that.  For my part, I'd say The Right Thing
To Do is to use UTF-8 consistently in all use of std::string so that
client code that needs to do something with the result that needs a
different encoding can do its own conversion, knowing what the baseline
is for what it's getting from Qt.

At least some uses of std::format shall be to produce data to send to a
database or over a socket, where the current user's settings are
irrelevant, so using the user's local encoding strikes me as A Bad Plan.
It probably makes sense for std::print but, even then, we'll be better
off if we know the content being passed to std::print is in UTF-8,
because then print knows it always has to do to-local conversion, rather
than having to muck about determining which encoding it's being fed.
Other consumers of std::format's results (such as a database or socket)
may need some other encoding: whatever's interacting with those has, in
any case, to work out the right encoding to use for it and convert to it
if needed; its life shall be easier if it knows the std::format result
is always UTF-8, rather than having to query the user's settings to
determine whether they need to convert (whether it's even possible)
and how to do the conversion.

And, of course, using the user's native encoding may be broken because
it simply lacks a representation for some of the content you're
formatting, meaning that no down-stream processing can recover from the
fact that std::format didn't succeed in faithfully representing the data
it was given.

As usual, the only even remotely sane answer for 8-bit is UTF-8,

	Eddy.