[Development] Updating x86 SIMD support in Qt
Thiago Macieira
thiago.macieira at intel.com
Wed Jan 19 19:47:07 CET 2022
On Wednesday, 19 January 2022 09:28:40 PST Edward Welbourne wrote:
> Thiago Macieira (19 January 2022 17:48) replied:
> > That's a misconception. AVX and especially AVX2 introduce a lot of
> > codegen opportunities for the compilers, which they've been able to
> > use for years.
>
> Is the difference here:
> * We have code that overtly conditions on the availability of CPU
> features (for example in the places Lars mentioned) vs
> * The compiler can achieve some optimizations, on which we currently
> miss out, if we pass a relevant command-line option telling it to do
> so (or omit one telling it not to) ?
>
> (Hoping you'll educate me if I'm being dense.)
Hello Eddy
It's both.
The compilers can generate better code if the relevant flags are passed on the
command-line or are built-in. We build QtCore and QtGui (or maybe is it all
libraries now in Qt6?) with -O3, which enables the auto-vectoriser in GCC and
Clang. Raising the minimum targeted architecture allows them more options to
generate faster code.
But it's not just vector code. Compare
https://gcc.godbolt.org/z/h3WfWsGEz (x86-64 baseline)
with
https://gcc.godbolt.org/z/vPcKqbT7P (x86-64-v2, except for MSVC)
The floor() function in any good libc/libm will have the runtime detection and
use the instruction to implement the functionality. glibc does:
https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/multiarch/s_floor-sse4_1.S.html
The problems are:
* not all libc/libm are good. In particular, MinGW's support lacks those
optimisations (I've just checked). I haven't disassembled MSVC's Runtime to
find out what it does.
* some compilers may opt to not make a function call, like GCC did in the
example above. When running on a pre-SSE4 CPU, the GCC generated code is
actually better. However, since the overwhelming majority of CPUs this code
will run on do have SSE4, GCC's codegen is actually a pessimisation.
* even in the best case scenario (the other compilers), you still have a
function call, which on ELF platforms means going through the PLT. So
instead of a single instruction, we have a CALL, then an indirect JMP, then
that instruction. That's unnecessary overhead for 99.9% of all users.
This is not an isolated example. If I disassemble my QtCore, I see a lot of
other scalar instructions:
$ objdump -d libQt6Core.so | egrep -c '(movbe|sarx|shrx|shlx|[tl]zcnt|popcnt)'
638
Each of those saves a cycle here and there, so it's not worth making a runtime
decision to use them. Instead, they must be used opportunistically. This would
be especially beneficial for math-heavy libraries like Qt3D.
Then there's our optimised code. qstring.cpp has a lot of it and it's not
selected at runtime, for the same reason: the overhead of selecting is higher
than the benefit. Most of our strings are fairly small: a histogram of all
calls to those functions from a Qt Creator start shows they peak around 5-10
characters, then drop sharply with a long tail. This means those operations
suffer greatly from overhead and what matters most is latency, not throughput.
That's very different from image manipulation, in QtGui's drawhelpers: even a
small 16x16 image is 1024 bytes. So any overhead in making a selection is
quickly amortised there, but not so for strings.
Would it be worth for some of those operations in qstring.cpp? Probably,
particularly after my last round of optimisations. I especially think so if I
could use GNU IFUNC support, which would mean all callers would jump directly
into one of the optimised functions, instead of calling a function that then
calls another. But optimising the entire library means we get more, at the
cost of some extra time building and some more files in your system. It's also
a generic solution, instead of targetting particular functions. So it should
be a win-win: better performance at lower maintenance cost.
(*) we've long-since shortened it to 4 characters. And AVX512 has a trick that
allows us to use it even down to a single byte (see my outgoing changes in
Gerrit).
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering
More information about the Development
mailing list