[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Lars Knoll lars.knoll at gmail.com
Tue Apr 18 09:46:26 CEST 2023



> On 17 Apr 2023, at 18:16, Thiago Macieira <thiago.macieira at intel.com> wrote:
> 
> On Monday, 20 March 2023 08:44:30 CDT Edward Welbourne wrote:
>> Thiago Macieira (31 October 2019 22:11) wrote [0]:
>>> This RFC (...) is meant to discuss how we'll deal with locales on Unix
>>> systems on Qt 6. This does not apply to Windows because on Windows we
>>> cannot reasonably be expected to use UTF-8 for the 8-bit encoding.
>> 
>> [0]
>> https://lists.qt-project.org/pipermail/development/2019-October/037791.html
>> 
>> The GNU make mailing list currently has a thread (starts at [1]) about
>> handling of encodings on Windows.
>> 
>> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html
>> 
>> The discussion there seems to indicate that setting the system code-page
>> to UTF-8 can be done in a way that interoperates gracefully with other
>> processes and the file system, presumably thanks to the system being
>> substantially UTF-16-based, so all 8-bit encodings go via that anyway.
> 
> That only works for the file names, not the file contents and other channels. 
> For QProcess, we're slightly fortunate that we have UTF-16 API, so the 
> encoding that the other application uses for its command-line is irrelevant 
> for us.
> 
> But anything that goes through QIODeivce::read or write (QProcess, QFile, 
> Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on what that 
> encoding is. Usually for sockets, the protocol is binary and obviate the 
> problem. For files, some file formats help. But in particular for communicating 
> with another process, there's no reliable way.

Communicating through a socket will always require that both sides agree on the encoding. That’s not really anything new. 

The question is how they encode the data when writing to the socket. If they use QTextStream, the data will by default get written in utf8 already today (since Qt 6.0). If they explicitly convert the QString to and from a specific encoding using QStringConverter/QTextCodec nothing bad will happen neither.

So the remaining problem comes when they use QString::to/fromLocal8Bit(), as that might change from some windows locale to utf8. Not a problem when communicating with a socket between two Qt apps, but might be an issue when storing data in a file or communicating with an app that doesn’t use Qt.

But we could consider that a user error, as you really shouldn’t use local8bit for anything else than stdin/out and interfacing with 8bit system APIs.

> 
>> The means to achieve this appear [2] to hinge on setting the active
>> codepage for the application in a manifest file, that it gets combined
>> with after it is linked.
>> 
>> [2]
>> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-> code-page
> 
> That was already known at the time, in 2019. What has changed is that the 
> Windows API has matured to the point that this is now a viable choice 
> (previously, it was experimental and known to cause issues). But it's still an 
> application choice; we can't enforce it.

We did enforce it on Unix systems though with Qt 6. I do believe we can over time enforce it on windows as well, or at least make it the default.
> 
>> There do appear to be some vagaries still, it may depend on UCRT and I'm
>> not sure I've really understood it all, but it looks like we may, in
>> time, be able to consistently use UTF-8 as 8-bit encoding on Windows.
> 
> Sorry, no, we can't force users to do it because we don't know if their code 
> is safe.
> 
> But I think we should:
> a) do it for our own applications, since we do know our own code
> b) advise users somehow that they should opt-in to this
> c) decide if we want to change from opt-in to opt-out in the medium term (7.0 
>  for example)
> d) decide if we want to enforce it in the long-term
> 
> Option (d) depends on (c). Option (c) informs whether we need a Qt CMake API 
> or whether we can simply say upstream CMake should handle it.

I think this should be the goal, but I’d vote for a slightly faster schedule. 

(a) and (b) are things we should be able to do right now. All our apps work fine one Unix systems with a utf8 locale, so there should be relatively few problems doing the switch on Windows. The only thing this requires is a bit of cake infrastructure work (that I believe has been mostly done already), and some documentation for our users.

(c) is something we should also announce with a time schedule right now. I would go and do this either for 6.8 or 6.9 (ie with the next LTS release or directly afterwards). If we announce it now, it gives our users 1.5 to 2 years to adopt (and they can always opt out afterwards).

(d) is something I would do for Qt 7, as that’s the correct time to do those changes and clean up our code base

Cheers,
Lars





More information about the Development mailing list