[Development] QString and related changes for Qt 6
Lars Knoll
lars.knoll at qt.io
Tue May 12 09:49:06 CEST 2020
Hi all,
I’ve had a longer chat with Thiago about how to evolve QString for Qt 6 last week.
Some work has already happened, so both QString and QByteArray now share the data structure with QList/QVector, enabling zero-copy conversion between the types. There’s also some pending changes to transition those classes to qsizetype and removing the 32bit limitations we currently have.
My high level goal for the string classes in Qt 6 was to complete the Unicode related changes that we started for Qt 5.0, where we made utf8 and utf16 the main encodings, and simplify things. I believe it’s important to leave the non Unicode world behind us, and offer an as consistent cross-platform story here as we can.
Qt 5.x still has some left-overs from the pre-unicode world:
* QTextStream encodes in Latin1 by default, so do a couple of classes in some places
* While we assume Utf8 as the source encoding for Qt, we still use QLatin1String all over the place
* We have extensive support for legacy text encodings in Qt Core, that should not be there anymore in 2020
* We offer options to generate HTML or XML in legacy encodings, even though the standard clearly says that those are deprecated
* to/fromLocal8Bit() should be equivalent to to/fromUtf8() on all but Windows (where we’re still a few years away from fully getting rid of this)
* source code encoding is undefined
Cleaning this up has progressed quite a bit, and a lot of changes in various classes have been merged. There’s a large set of changes currently being reviewed the remove QTextCodec as a dependency in Qt (it’ll get moved to libQt5Compat), and introduce a new QStringConverter class, that can handle transcoding between Unicode encodings, Latin1 and the system locale. For all systems except Windows, we make the additional assumption that the system locale is UTF-8 (see also my other mail about UTF-8 as System locale on Windows).
A next step is to change the build system, so that it (by default) assumes that source code is encoded in UTF-8. We are lady do set compiler flags to ensure this when building Qt itself, but are not doing this yet for user code.
But gcc and clang do already treat all source code as UTF-8 by default (and I believe ICC does the same at least on platforms other than Windows). MSVC will require a /utf-8 flag to enable this, something that I want to add to the default config for both qmake and cmake when compiling a Qt app. Without it, MSVC will still assume the source code is encoded in the current ANSI code page and u”…” or u8”…” will result in garbage. Worse it’ll lead to non portable code, that might compile correctly on one developer machine and create garbage on the next one (as it uses a different locale).
Changing this also for our users will make source code written for Qt more portable and bring Qt on par with most other programming languages in the world that already mandate utf8 as the source encoding (JS, Swift, Java, etc).
Our string handling classes currently consist of the following classes: QByteArray, QString, QStringView, QStringRef, QStringLiteral and QLatin1String. The set it too large, inconsistent and needs cleaning up:
* With the source code encoding being utf8, QLatin1String makes a lot less sense, and I my goal is to deprecate/deprioritize it in Qt 6. Instead, I would like to advocate the use of u”…” to directly encode the string as utf-16.
* QStringRef has been superseded by QStringView and should get deprecated. The main hurdle here is it’s use in QXmlStream. The plan is to extend QXmlStringRef (yes, that one exists as well…) to cover the use case. Both QXmlStringRef and QStringRef will get a cast operator to QStringView. With that we can then remove all API that takes a QStringRef and replace it with API taking either a QString or a QStringView
* QStringLiteral should turn into a small wrapper around u”…”, and probably also get deprecated. Maybe we could add a user defined literal for it instead that returns a read-only QString (QString s = “…”_q;). So u”…” would lead to a QStringView, u”…”_q to a read-only QString.
* We should add a QByteArrayView to keep symmetry between the QString and QByteArray APIs. This is somewhat independent from the rest though and lower priority.
* QStringView and QByteArrayView need to be completed to implement all const methods of QString/QByteArray
* A basic different between QString and and QStringView will be that the view class can contain non zero terminated data and are read-only, while QString will guarantee a zero termination (I checked whether we can remove that enforcement, but it will break too much code). Sidenote: Currently, fromRawData() together with utf16() can break this assumption, we should fix this
* QByteArray’s methods like toUpper() will only handle ASCII characters (they assume Latin1 in Qt5).
This would leave us with 4 string related classes: QByteArray(View) and QString(View).
Another step that is already partially implemented is to allow read-only string data inside a QString with a null d-pointer. This will serve as a replacement for QStringLiteral, and allow us to pass read-only data without copying into all of our API. As opposed to QStringView, QString will however require zero termination.
One open question is whether we should add a QUtf8String with a char8_t. I am not yet convinced that we actually need the class though.
The next question is what we do with our API methods. Currently we have many places where we have three to 4 overloads for the same methods (taking a QString, a QStringView, a QStringRef and a QLatin1String). We can’t have 4 overloads for each method in all of Qt, so we need to restrict overloads to the places where it is required. IMO this is mainly the string related classes themselves. And even there we can probably cut down on the number of overloads.
In most other places we should by default only use QString, unless there are very significant performance benefits to be had from using QStringView. This helps us keep an API that’s both easy to use and maintain. With the ideas above, you can still create a read-only string, so data copies can in many cases be avoided if required.
I believe this adds only API clutter with very little benefit to our users. We should cut down on that multitude and offer only one version. In most cases the API should simply take a QString (esp as we can do read-only strings without allocation). QStringView should be used in low level APIs where performance matters and we are certain that we will never require a copy of the input string. The QStringRef and QL1String overloads should simply disappear.
So this would give the following API guidelines for using QString(View) and related classes in Qt:
For String related classes:
* All methods not taking ownership of the passed arguments take a QStringView
* If the method stores a pointer to the passed data it should take a QString to not surprise users. Exceptions can be done where it makes sense, but then the method naming has to give clear indications that this happens (like e.g. fromRawData())
* Return a QString in QString itself and when doing conversions, return QStringView from QStringView
* No QStringRef!
* QLatin1String for backwards compatibility, can be disabled with a macro (similar to QT_NO_CAST_FROM_ASCII)
* Remove or deprecate overloads taking a (const Char *, length) pair. Replace them with QStringView
Most other classes:
* Only take and return QString
* Exceptions can be done where significant performance gains can be demonstrated and the API will by design not require a copy of the data (e.g. XML writer, stream writers, date time handling)
Implementation wise, here are the steps we need to take:
* Finish the QString/QStringView API symmetry. QStringView should offer the complete const API of QString. QString’s const methods are implemented in terms of QStringView (or a static method in a private namespace is used for both).
* (Lower priority) Do the same for QByteArray
* Remove all QStringRef uses except for QXmlStream (which is a beast in itself)
* Rework QXmlStreamReader. Most likely simply return a QXmlStringRef instead of QStringRef (and extend it’s API slightly + add cast operators to QStringView and QString). This is somewhat SIC, but fortunately XML parsing is usually limited to a very restricted part of any project
* Deprecate QStringRef
* Enable /utf8 mode in MSVC for qmake and cmake builds (for us and our users) by default. Offer a simple way to turn it off for backwards compatibility
* Consider doing some of the things for utf-8 support on Windows outlined in https://lists.qt-project.org/pipermail/development/2020-May/039421.html
* Our QLatin1String uses are in most cases about pure ASCII strings. In any case, we should consider mass porting them over to u”…” instead.
* I don’t think we can deprecate QL1String and QStringLiteral in 6.0, but we should offer a mode to disable them.
Comments are welcome, help to implement the plan even more :)
Cheers,
Lars
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20200512/608c588b/attachment-0001.html>
More information about the Development
mailing list