[Development] std::format support for Qt string types

Thu Jun 6 17:07:31 CEST 2024

Hello,

On 05/06/2024 15:18, Ivan Solovev via Development wrote:
> Hi,
> 
> I'm now working on introducing std::format support for some of the Qt types.
> I decided to start with the variety of Qt string types, and I have some open
> question regarding the implementation that I want to discuss.
> 
> First, I'd like to give a very short summary of my understanding of how
> std::format works in plain C++ when it comes to string formatting.
> Basically, we have two types of of formatters:
> * std::formatter<T, char> that handle std::string, const char *, and
>   const char (&)[N] overloads.
> * std::formatter<T, wchar_t> that handle std::wstring, const wchar_t *,
>   and const wchar_t (&)[N] overloads.
> 
> The encoding for the wide char strings is usually known - it's either UTF-16
> on Windows or UTF-32 on Linux and macOS.
> But what is the encoding for the char strings? The answer is that 
> std::format
> does not care. It just tries to format the characters according to the 
> format
> string. What you see in the terminal fully depends on your terminal 
> encoding.

I think we should conceptually separate formatting from printing on a 
terminal. std::format isn't _just_ for printing on terminals (we now 
have std::print for that). Having said it, I admit that I've fallen 
quite behind

> So, back to the main question. How we should format Qt string types?

Just for the sake of discussion, we can also leave the problem unsolved 
until std::format works with Unicode strings. As much as that's a pain 
point for users, we won't paint ourselves in a corner (see below).

> The support for wide char formatters is straightforward - we can use
> QString::toStdWString() and be sure that we do not get any unreadable
> characters in the formatted output.
> I already have a WIP patch implementing it [0].

In general, I'm not too fond of the idea that we need to re-encode 
strings (= allocations) in order to format them, but I don't see an easy 
way out given the tools at our disposal...

> But what to do with the char formatters? Should we aim for the formatted
> strings to be always readable, or should we just not care, like the
> std::formatter<char> does?

What do you mean by "readable" here?

> I see several options here:
> 
> 1. Treat everything as UTF-8
> 
> Traditionally all QString(View) constructors taking char arrays or 
> std::string
> treat the data as UTF-8. Also, QString::toStdString() provides a UTF-8 
> encoded
> std::string. So this would be sort of an expected behavior for Qt users.
> 
> With this approach QLatin1StringView should also be converted to UTF-8 
> before
> being processed by the formatter.

That sounds definitely appealing, in the sense that in any text-based 
APIs, we expect `char` to be UTF-8. So, formatting into chars means 
formatting into UTF-8.

> 2. Treat everything as Local8Bit
> 
> Basically similar to the previous approach, but use toLocal8Bit() instead of
> toUtf8() when passing the data to the formatter. On Linux and macOS that 
> would
> actually be equivalent to the first approach, because toLocal8Bit() simply
> assumes UTF-8 as an encoding. On Windows it would use CP_ACP to do the
> conversion.
> 
> In this case the behavior would be similar to what qDebug() does.

Again, I'm not really sure of entangling consoles with this.
If you go for this approach and std::print a QString on Windows, what 
kind of output do you get?

> The drawback is that the formatted string might be different from the 
> original
> one. For example, `Ü` might be replaced with `U`, some other symbols 
> might be
> replaced with `?`, depending on the currently selected code page.
> 
> Similarly to the previous option, QLatin1StringView and QUtf8StringView 
> should
> also be converted to Local8Bit before formatting.
> 
> 3. Try to not guess the encoding for the user
> 
> Basically, for QUtf8StringView and QLatin1StringView their encoding is
> explicitly mentioned in the names of the classes, so we can just 
> consider that
> if the users use these classes with std::format, they expect to have UTF-8
> or Latin1 output respectively.

I'm not following this. If I do

  std::format("{} {}", utf8string, latin1string)

what am I supposed to get out? A string which is a mix of two different 
encodings? I don't think that's ever possibly wanted.

> 
> Question here is how to deal with QString(View)?
>   3a. Convert it to UTF-8, because that's the pre-existing behavior which
>       should be known for the users.
>   3b. Do not implement std::formatter<QString(View), char> at all and let
>       the users explicitly convert QString to something else first.
> 
> Option 3b is inconvenient and defeats the purpose of std::format support
> for Qt types, so I'd personally prefer 3a here.

The concern I was quoting before is this: suppose that tomorrow we have 
a formatter for `const char16_t *` into char. This formatter does some 
kind of transcoding. Then QString(View) ought to do precisely the same! 
If we take a different decision now, we risk having compatibility 
problems down the line.

Now, I don't really know if formatting char16_t is anywhere on SG16's 
radar in the short term, but that sounds definitely something to 
investigate and report about, in order to make a more informed decision.

(Not to mention formatting _into_ char16_t, which would unlock something 
like QString::format to *create* a QString!)

Thanks,

-- 
Giuseppe D'Angelo | giuseppe.dangelo at kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - Trusted Software Excellence

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4244 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.qt-project.org/pipermail/development/attachments/20240606/9bd883d0/attachment.bin>