[Development] Updating x86 SIMD support in Qt

Wed Jan 19 19:47:07 CET 2022

On Wednesday, 19 January 2022 09:28:40 PST Edward Welbourne wrote:
> Thiago Macieira (19 January 2022 17:48) replied:
> > That's a misconception. AVX and especially AVX2 introduce a lot of
> > codegen opportunities for the compilers, which they've been able to
> > use for years.
> 
> Is the difference here:
> * We have code that overtly conditions on the availability of CPU
>   features (for example in the places Lars mentioned) vs
> * The compiler can achieve some optimizations, on which we currently
>   miss out, if we pass a relevant command-line option telling it to do
>   so (or omit one telling it not to) ?
> 
> (Hoping you'll educate me if I'm being dense.)

Hello Eddy

It's both.

The compilers can generate better code if the relevant flags are passed on the 
command-line or are built-in. We build QtCore and QtGui (or maybe is it all 
libraries now in Qt6?) with -O3, which enables the auto-vectoriser in GCC and 
Clang. Raising the minimum targeted architecture allows them more options to 
generate faster code.

But it's not just vector code. Compare
 https://gcc.godbolt.org/z/h3WfWsGEz (x86-64 baseline)
with
 https://gcc.godbolt.org/z/vPcKqbT7P (x86-64-v2, except for MSVC)

The floor() function in any good libc/libm will have the runtime detection and 
use the instruction to implement the functionality. glibc does:
https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/multiarch/s_floor-sse4_1.S.html

The problems are:
* not all libc/libm are good. In particular, MinGW's support lacks those 
  optimisations (I've just checked). I haven't disassembled MSVC's Runtime to 
  find out what it does.

* some compilers may opt to not make a function call, like GCC did in the 
  example above. When running on a pre-SSE4 CPU, the GCC generated code is 
  actually better. However, since the overwhelming majority of CPUs this code 
  will run on do have SSE4, GCC's codegen is actually a pessimisation.

* even in the best case scenario (the other compilers), you still have a 
  function call, which on ELF platforms means going through the PLT. So 
  instead of a single instruction, we have a CALL, then an indirect JMP, then 
  that instruction. That's unnecessary overhead for 99.9% of all users.

Each of those saves a cycle here and there, so it's not worth making a runtime 
decision to use them. Instead, they must be used opportunistically. This would 
be especially beneficial for math-heavy libraries like Qt3D.

Then there's our optimised code. qstring.cpp has a lot of it and it's not 
selected at runtime, for the same reason: the overhead of selecting is higher 
than the benefit. Most of our strings are fairly small: a histogram of all 
calls to those functions from a Qt Creator start shows they peak around 5-10 
characters, then drop sharply with a long tail. This means those operations 
suffer greatly from overhead and what matters most is latency, not throughput. 
That's very different from image manipulation, in QtGui's drawhelpers: even a 
small 16x16 image is 1024 bytes. So any overhead in making a selection is 
quickly amortised there, but not so for strings.

Would it be worth for some of those operations in qstring.cpp? Probably, 
particularly after my last round of optimisations. I especially think so if I 
could use GNU IFUNC support, which would mean all callers would jump directly 
into one of the optimised functions, instead of calling a function that then 
calls another. But optimising the entire library means we get more, at the 
cost of some extra time building and some more files in your system. It's also 
a generic solution, instead of targetting particular functions. So it should 
be a win-win: better performance at lower maintenance cost.

(*) we've long-since shortened it to 4 characters. And AVX512 has a trick that 
allows us to use it even down to a single byte (see my outgoing changes in 
Gerrit).

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering