[Development] Updating x86 SIMD support in Qt

Wed Jan 19 17:48:37 CET 2022

On Wednesday, 19 January 2022 00:13:32 PST Lars Knoll wrote:
> The main thing I’m wondering about is how much performance we gain by from a
> multi arch build Qt for different x86_64 architectures opposed to building
> maybe for v2 and detecting/using AVX and AVX512 at runtime. I would assume
> it’s little we gain, as there are very few places where the compilers
> auto-vectorizer will be able to emit AVX instructions, but I might be wrong
> here.

> AVX is only used by a couple of classes in Qt Core and the drawhelper in Qt
> Gui. Qt Gui already does runtime detection, so it would be only about
> adding that to the methods in Qt Core. 

Hello Lars

That's a misconception. AVX and especially AVX2 introduce a lot of codegen 
opportunities for the compilers, which they've been able to use for years. The 
v3 level also introduces some lesser-known features like MOVBE and BMI, which 
are new scalar instructions, not SIMD. But even v2 brings in interesting 
instructions that match practically 1:1 some C library functions, like  
ROUNDPS and ROUNDPD.

QtGui draw helpers do runtime detection, but QtCore does not. And this is the 
issue: each of those individual optimisations is small enough that the 
overhead of selecting at runtime is higher than the benefit. We're talking 
about very hot functions such as QString::fromLatin1 taking an (extra) 
indirect function call for everything. But the aggregate of those 
optimisations is worth it, for negligible cost in producing them: build the 
entire library twice and let the dynamic linker choose.

I do plan on looking into runtime detection in QtCore for 6.5, but I'm not 
convinced I can make it work with sufficient low overhead in all platforms. I 
know I can for Linux with IFUNC support, but doing so for Linux only is not 
worth our collective time (development, review, long-term maintenance).

For the QtGui runtime detection, we have a ticking time bomb. See below.

> > I propose we remove the tests for the intrinsics of each individual CPU 
> > feature. Instead, let's just assume they all have everything up to 2016.
> > This  will shorten cmake time a little and fix the macOS universal
> > builds. It'll also change how 32-bit non-SSE2 builds are selected (see
> > below).
> > 
> > The change https://codereview.qt-project.org/c/qt/qtbase/+/386738 is going
> > in  this direction but retains a test (all or nothing). I'm proposing
> > now we remove the test completely and just assume.
> 
> 
> I’m fine with that, I don’t think we need to support a compiler that doesn’t
> support those. 

As I said, for the record: all compilers we support do support all of them, 
based on the CI runs of those changes. There's only one issue and it's the QCC 
compiler lacking the RDSEED intrinsics, but that's easily worked around (it 
does support inline assembly and it might support the low-level 
__builtin_ia32_rdseed_si_step() intrinsic.

> Can we at the same time do the same thing for NEON btw. While there are some
> platforms that don’t support NEON, I believe all compilers do support
> them.

Neon is different.

First, it's never selected at runtime. The issue that affected the NEON code 
generation is actually present in x86 too, but we've been lucky to avoid it. 
This is the ticking time bomb I was talking about: whenever you compile C++ 
sources and the compiler decides not to inline an inlineable function, it will 
create a copy of it and call that. This may happen in multiple translation 
units, so the linker chooses any one of them to emit in the final binary 
(usually, it's the one from the first .o that offered it, but that's just 
implementation behaviour). 

So what happens if the copy came from the higher-target .o? Kaboom. This is 
what affected our Neon builds and this is why we don't detect it at runtime. 
I've spoken to GCC and GNU Binutils maintainers about this issue. There's 
little incentive for them to provide a workaround for this, so it won't 
happen. The only solution they offer is to have different libraries or 
plugins.

So, can we change how we detect Neon? Sure. We can simply assume it's there 
all the time, since it is there all the time anyway. I know the ARMv8 
architecture (read: 64-bit, AArch64) requires it, like 64-bit x86 requires 
SSE2, so that's inescapable. The question is therefore whether we want to 
always enable it on 32-bit ARMv7. I'd say it should get the same answer as 32-
bit i386: yes, enable it by default but allow disabling it.

> See my comment above. We also need to think about non Linux platforms.
> Multi-arch is difficult on Windows as far as I know, so a v2 baseline build
> and runtime detection might be preferable.

Indeed, this solution is specific to glibc-based Linux. It does not apply to 
other OSes because it's a solution predicated on the system's dynamic linker 
being able to select different files or different sections of a file based on 
CPU identification.

If Microsoft wants their OS to have better-performing content, they'll have to 
come up with a solution. I do plan to reach out to them via the team that 
works with them at Intel, but I don't expect to see any solution, at least not 
before 2030. There are some workarounds for this (search for "delay-loaded 
DLL") but they've left me with a bad taste in my mouth.

> > 5) for glibc-based Linux, add v3 sub-arch by default
> > 
> > I'd like to raise the default on Linux from baseline to v2 *and* add a v3
> > sub-arch build, as described by point #3 above.
> > 
> > Device-specific Qt builds (Yocto Project, Boot2Qt) would need to turn this
> > off  and select a single architecture, if they don't want the extra
> > files.
> 
> This complicates the build system and deployment in quite a few places and
> is a Linux specific solution. Can you give some numbers how much of an
> improvement this would give over runtime detection where we have AVX
> optimised code?

Open Phoronix.com and search for Clear Linux. Almost every single time we've 
won a benchmark, it was because Clear Linux defaults to:
* v2 as the minimum
* v3 for quite a lot of libraries (including qtbase and qt3d)
* v4 for a few, especially libm

I don't expect this solution to complicate the build THAT much. I expect it's 
basically making qt_internal_add_module() CMake function create two targets 
instead of one based on some opt-in flag we set for the handful of libraries 
we think there's value in doing this. Then it builds the same sources twice 
and installs two sets of binaries and their symlinks.

Clear Linux attempts to use a heuristic to guess which libraries it thinks are 
worth keeping the AVX2 version of. To see which ones it thought of qtbase, see
https://github.com/clearlinux-pkgs/qtbase/blob/
e16f08be736d28351219b05e807a6468ea39341b/qtbase.spec#L5771-L5902

For my OpenSUSE desktop, I didn't go nearly as far. I simply built QtCore and 
QtGui twice. See lines 938-944 of
https://build.opensuse.org/package/view_file/
home:thiagomacieira:branches:openSUSE:Factory/libqt5-qtbase/libqt5-
qtbase.spec?expand=1

And this is why I want to do this inside Qt itself and do it by default: 
because I've had to manually solve it twice for two different distros. It's 
not reasonable to expect every one of them to copy the solutions.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering