[Development] Oslo, we have a problem</apollo 13> [char8_t]

Sun Jul 7 11:58:15 CEST 2019

Hi Thiago,

 > But QByteArray is encoding-indeterminate since it can carry any type.

Correct, it is often used as "generic raw data array", e.g. in QFile,
Q*Socket, QSerialPort, QCanBusFrame etc. Here we really need to treat
the data as-is, without interpretation.

 > Arguably, toUpper() and toLower() should be removed, since
 >
 > 	QByteArray(u8"Résumé").toLower()
 > is mojibake.

I vote against that. If you got the "raw" data from a device as
described above, you might want to do .toHex().toUpper() which is fully
valid.

So either:

- restrict the functions now operating on Latin1 functions to pure ASCII
- or give a possibility to select the encoding in QByteArray
   (e.g. by adding fromLatin1() and fromUtf8()). That would also
   add the possibility to correctly operate on UTF-8 strings.
   But probably a separate QString8 class would be better.

 > I wouldn't mind a udata() function anyway, since there's a lot of code
 > dealing with "bytes" as unsigned char.

+1

 > Are we willing to add ubegin() and begin8() too?

If that all fit's in one class, why not?!

Best regards,
André

On 06.07.19 18:59, Thiago Macieira wrote:
> On Saturday, 6 July 2019 11:09:36 -03 Mutz, Marc via Development wrote:
>>> Anyway, QByteArray has *Latin1* text-manipulation functions (toUpper
>>> and
>>> toLower), its split(char) function will happily split on indivdual
>>> bytes of an
>>> UTF-8 multibyte sequence, so adding char8_t overloads seems just wrong
>>> to me.
>>
>> const char* in Qt is always assumed to be UTF-8-encoded. You need to use
>> QLatin1String to have it interpreted as Latin-1:
>>
>> https://doc.qt.io/qt-5/qstring.html#QString-8
>> https://doc.qt.io/qt-5/qstring.html#QString-7
>
> That's QString, not QByteArray.
>
> But QByteArray is encoding-indeterminate since it can carry any type.
> Arguably, toUpper() and toLower() should be removed, since
>
> 	QByteArray(u8"Résumé").toLower()
> is mojibake.
>
> In fact, QByteArray should use std::byte in functions like data(), but that's
> unwieldy and breaks too much compatibility.
>
>>> What did you try to use QByteArray with that showed problems?
>>
>> Just QByteArray(u8"Hello") already fails when compiled with -std=c++2a.
>> And this is also why we need to fix it. The same compiles fine in C++17,
>> and does the expected thing.
>
> I think we need to talk to SG16.
>
> We can add the template overloads to all functions so we can take char,
> unsigned char, std::byte and char8_t without complaining. I am with you that
> this could result in explosive compile times[1]. But it also does not solve
> the problem of what type data() / constData() and the iteration functions
> return.
>
> I wouldn't mind a udata() function anyway, since there's a lot of code dealing
> with "bytes" as unsigned char. Are we willing to add ubegin() and begin8()
> too?
>
> [1] Please, no one say "Modules!" here, it's not a full solution, even if we
> can use them in Qt 6's lifetime.
>