[Development] format-like tr()

Thu Oct 24 19:12:35 CEST 2024

On Thursday 24 October 2024 03:22:55 Pacific Daylight Time Ivan Solovev via 
Development wrote:
> > Unfortunately, I think we'll need the entire parsing and constructing done
> > from scratch.
> 
> IMO, we could benefit from the new syntax, if we could build our
> implementation on top of what the standard provides for us. But I see
> little benefit in reimplementing the standard from scratch.

I'm hoping we don't have to do it all. But I don't know yet how far we can go. 
Especially since we mean to do a mixed encoding solution:

> I do understand that the standard (even in C++26) does not support
> formatting into char16_t, but as you pointed out, the l10n and logging are
> in UTF-8 anyway.

Not exactly. I meant that the format template is provided in UTF-8, but the 
output of the formatting is UTF-16, like QString::asprintf(). We will convert 
on-the-fly.

Maybe we buffer in UTF-8 and convert to UTF-16 in chunks. This makes sense, as 
the algorithms in qstringconverter.cpp are better with bigger chunks. This 
would allow us to use all of the std::formatters to char, instead of requiring 
formatters to char16_t (which we could still support). For example, in

  // Don't write JSON like this!
  QByteArray key = getKey();
  QString json = qFormat("{{\"{}\": {}}}", key, number);

we'd have a concatenation of
 "{\""
 key
 "\": "
 number
 "}"

and with your typical JSON key name of 4-8 characters and a number of 1-5, 
we'd trigger the optimal conversion paths in qstringconverter.cpp maybe once, 
maybe twice out of 5 calls. But a combined string of 11 to 17 characters would 
always trigger one of them.

(With last night's 15e801488065e93dde5eb409c8f93795371db300, the conversion 
from UTF-8 uses SIMD now starting at 4 characters and supports any size up 
from there. Previously, it required multiples of 8)

With translated strings, it gets slightly more difficult because QTranslator and 
our .qm files operate on UTF-16. So we'd need to convert back to UTF-8 to use 
the std format parser. We can also just change the .qm file format to include 
UTF-8 and ditch the big endian UTF-16, figuring out a way to get through 
QTranslator's virtual functions (it's a QObject, so hopefully everyone 
provides proper meta objects for them). The .mo file format is already UTF-8.

> As for me, the standard really misses the customization points for standard
> formatters. But I still think that we should try to use what it provides,
> instead of inventing a completely custom implementation.

Can you elaborate on what it misses?

> >> And I still used std::format to do the actual formatting, because I
> >> didn't want to mess with width and padding on my own.
> > 
> > For numbers, I agree. For strings, I question the usefulness.
> 
> Even for strings it's not straightforward, because not every code point has
> a width of 1. It's all documented in the standard, but still an extra
> complexity while implementing.

Hence my questioning. We don't care much about formatting to terminals. Even 
though we have a set of really good non-GUI libraries, our main focus is GUI. 
That's for example why our QCommandLineParser isn't full of features.

Formatting strings for display in GUI has a completely different paradigm than 
in terminals. The width with variable-width fonts may have little to do with 
what the implementations do to support terminals. And even then, does it even 
make sense to trim or pad strings in GUI? If you want to align text, you use 
multiple QLabels in a QGridLayout. If you want to trim text, you want 
QFontMetrics::elideText().

That's not to say we don't care at all about fixed-width support. There are 
terminal-like things in GUI, like QPlainTextEditor and similar.

Finally, there's a question of differences of implementation. For formatting 
our strings, I think we should not depend on what the Standard Library 
implements for width calculations. We should use QTextBoundaryFinder (which is 
in QtCore) and uses the Unicode tables we already have in our library anyway.

This also means this content should probably be in qstringformat.cpp, not 
added to qstring.cpp.

> Well, the argument id should be a non-negative integer value, yes. But we
> can have our custom format string. Maybe we can have some custom character
> in the beginning, and then delegate the parsing to default one. Something
> like:
> 
>  std::format("{0:n*^10.2s}", "test");
> 
> Then the parse() method of our custom formatter will parse the first
> character, check if it's 'n', and delegate the rest of the parsing to
> std::formatter<std::string_view>::parse().
> Didn't check if it works in practice, though.

See the other replies, starting with Volker's.

> > But you didn't answer the important part of the question: the padding and
> > filling. The original string in Engineering English was {:%T}, which means
> > no padding and no filling. Are the translators allowed to change it?
> 
> I do not have any experience in l10n, so I cannot answer if that's the right
> thing to do or not. But from the developer's point of view it only requires
> that we do the format parsing at runtime instead of compile time, doesn't
> it? But we'll anyway have to do that if we allow to change the format
> specifiers while doing the translations.

That's why I had cc'ed Volker K and Albert, to get their opinions. Adding 
Nicolas now.

See also my question about using locale digits. Because yes, there may be 
formatting mods that the translator supplies.

As it stands, I think:
* translatable text should not use padding, trimming & filling
* ideally we'd compile-time reject those, but we can just ignore
* tooling should reject extracting strings with those
* tooling should reject translators inserting them
* formatting content should ignore them from the translated string

That does not apply to selection of variants, like the "L" for numbers.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Principal Engineer - Intel DCAI Platform & System Engineering
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5152 bytes
Desc: not available
URL: <http://lists.qt-project.org/pipermail/development/attachments/20241024/2f71c038/attachment.bin>