[Development] format-like tr()

Fri Oct 25 10:20:59 CEST 2024

>> As for me, the standard really misses the customization points for standard
>> formatters. But I still think that we should try to use what it provides,
>> instead of inventing a completely custom implementation.
>
> Can you elaborate on what it misses?

For example, for QBA(V) work I'd really like to use the std parser, but extend
it with some custom types.
The standard string formatter supports only 's' (and '?' in C++23).
It would be really convenient to be able to specify more types without the
need to reimplement the parsing of all previous format parameters (width,
padding, alignment, fill char).
Consequently, we also need a way to access the data that the parser extracted
from the format string. Since the standard strictly defines the format
specification for some types, it can also define that std::formatter
specializations for those types should have certain getters...

------------------------------

Ivan

________________________________________
From: Development <development-bounces at qt-project.org> on behalf of Thiago Macieira <thiago.macieira at intel.com>
Sent: Thursday, October 24, 2024 7:12 PM
To: development at qt-project.org; Nicolas Fella
Subject: Re: [Development] format-like tr()

On Thursday 24 October 2024 03:22:55 Pacific Daylight Time Ivan Solovev via
Development wrote:
> > Unfortunately, I think we'll need the entire parsing and constructing done
> > from scratch.
>
> IMO, we could benefit from the new syntax, if we could build our
> implementation on top of what the standard provides for us. But I see
> little benefit in reimplementing the standard from scratch.

I'm hoping we don't have to do it all. But I don't know yet how far we can go.
Especially since we mean to do a mixed encoding solution:

> I do understand that the standard (even in C++26) does not support
> formatting into char16_t, but as you pointed out, the l10n and logging are
> in UTF-8 anyway.

Not exactly. I meant that the format template is provided in UTF-8, but the
output of the formatting is UTF-16, like QString::asprintf(). We will convert
on-the-fly.

Maybe we buffer in UTF-8 and convert to UTF-16 in chunks. This makes sense, as
the algorithms in qstringconverter.cpp are better with bigger chunks. This
would allow us to use all of the std::formatters to char, instead of requiring
formatters to char16_t (which we could still support). For example, in

  // Don't write JSON like this!
  QByteArray key = getKey();
  QString json = qFormat("{{\"{}\": {}}}", key, number);

we'd have a concatenation of
 "{\""
 key
 "\": "
 number
 "}"

and with your typical JSON key name of 4-8 characters and a number of 1-5,
we'd trigger the optimal conversion paths in qstringconverter.cpp maybe once,
maybe twice out of 5 calls. But a combined string of 11 to 17 characters would
always trigger one of them.

(With last night's 15e801488065e93dde5eb409c8f93795371db300, the conversion
from UTF-8 uses SIMD now starting at 4 characters and supports any size up
from there. Previously, it required multiples of 8)

With translated strings, it gets slightly more difficult because QTranslator and
our .qm files operate on UTF-16. So we'd need to convert back to UTF-8 to use
the std format parser. We can also just change the .qm file format to include
UTF-8 and ditch the big endian UTF-16, figuring out a way to get through
QTranslator's virtual functions (it's a QObject, so hopefully everyone
provides proper meta objects for them). The .mo file format is already UTF-8.

> As for me, the standard really misses the customization points for standard
> formatters. But I still think that we should try to use what it provides,
> instead of inventing a completely custom implementation.

Can you elaborate on what it misses?

> >> And I still used std::format to do the actual formatting, because I
> >> didn't want to mess with width and padding on my own.
> >
> > For numbers, I agree. For strings, I question the usefulness.
>
> Even for strings it's not straightforward, because not every code point has
> a width of 1. It's all documented in the standard, but still an extra
> complexity while implementing.

Hence my questioning. We don't care much about formatting to terminals. Even
though we have a set of really good non-GUI libraries, our main focus is GUI.
That's for example why our QCommandLineParser isn't full of features.

Formatting strings for display in GUI has a completely different paradigm than
in terminals. The width with variable-width fonts may have little to do with
what the implementations do to support terminals. And even then, does it even
make sense to trim or pad strings in GUI? If you want to align text, you use
multiple QLabels in a QGridLayout. If you want to trim text, you want
QFontMetrics::elideText().

That's not to say we don't care at all about fixed-width support. There are
terminal-like things in GUI, like QPlainTextEditor and similar.

Finally, there's a question of differences of implementation. For formatting
our strings, I think we should not depend on what the Standard Library
implements for width calculations. We should use QTextBoundaryFinder (which is
in QtCore) and uses the Unicode tables we already have in our library anyway.

This also means this content should probably be in qstringformat.cpp, not
added to qstring.cpp.

> Well, the argument id should be a non-negative integer value, yes. But we
> can have our custom format string. Maybe we can have some custom character
> in the beginning, and then delegate the parsing to default one. Something
> like:
>
>  std::format("{0:n*^10.2s}", "test");
>
> Then the parse() method of our custom formatter will parse the first
> character, check if it's 'n', and delegate the rest of the parsing to
> std::formatter<std::string_view>::parse().
> Didn't check if it works in practice, though.

See the other replies, starting with Volker's.

> > But you didn't answer the important part of the question: the padding and
> > filling. The original string in Engineering English was {:%T}, which means
> > no padding and no filling. Are the translators allowed to change it?
>
> I do not have any experience in l10n, so I cannot answer if that's the right
> thing to do or not. But from the developer's point of view it only requires
> that we do the format parsing at runtime instead of compile time, doesn't
> it? But we'll anyway have to do that if we allow to change the format
> specifiers while doing the translations.

That's why I had cc'ed Volker K and Albert, to get their opinions. Adding
Nicolas now.

See also my question about using locale digits. Because yes, there may be
formatting mods that the translator supplies.

As it stands, I think:
* translatable text should not use padding, trimming & filling
* ideally we'd compile-time reject those, but we can just ignore
* tooling should reject extracting strings with those
* tooling should reject translators inserting them
* formatting content should ignore them from the translated string

That does not apply to selection of variants, like the "L" for numbers.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Principal Engineer - Intel DCAI Platform & System Engineering