[Development] How qAsConst and qExchange lead to qNN
thiago.macieira at intel.com
Thu Nov 17 19:56:22 CET 2022
On Thursday, 17 November 2022 10:24:35 PST Volker Hilsheimer via Development
> > Though I am postponing the QString vectorisation update to 6.6 because I
> > don't have time to collect the benchmarks to prove I'm right before
> > feature freeze next Friday.
> Next Friday is the platform & module freeze. Feature freeze is not until
> December 9th, i.e. another 3 weeks to go.
Next Friday is also the day after Thanksgiving here in the US.
I don't expect I can finish the benchmarking in 3 weeks, not considering I need
to finish the IPC work and that includes starting a couple of changes that I
haven't started yet (like the ability to clean up after itself).
For the benchmarking, I've already collected the data by instrumenting each of
the functions in question and running a Qt build, a Qt Creator start and a Qt
build inside Qt Creator:
qt-build-data.tar.xz: 1197.3 MB
qtcreator-nosession.tar.xz: 2690.0 MB
qtcreator-session.tar.xz: 35134.6 MB
The data retains its intra-cacheline alignment.
The way I'm seeing it, is that for each of the algorithm generations, I need
1) find the asymptotic limits, given L1, L2 and L3 cache sizes
That is, the algorithms should be fast enough that the bottleneck is the
transfer of data. There's no way that running qustrchr on 35 GB is going to be
bound by anything other than RAM bandwidth or, in my laptop's case, the NVMe.
So what are those limits?
2) benchmark at several data set sizes (half to 75% of L1, half to 75% of L2)
on several generations
Confirm that the algorithm is running close to or better than the ideal run
that LLVM-MCA showed when I designed them. I know I can benchmark throughput
to see if we're reaching the target bytes/cycle processing, but I don't know
if I can benchmark the latency. I also don't know if it matters.
3) benchmark at several input sizes (i.e., strings of 4 characters, 8
Same as #2, but instead of running over the sample that adds up to a certain
data size, select the input such that the strings have always the same size.
4) compare to the previous generation's algorithm to confirm it's actually
Different instructions have different pros and cons; what might work for one at
a given data size may not for another
The algorithms available are:
* baseline SSE2: no comparisons
* SSE 4.1: compare to baseline SSE2, current SSE 4.1
* AVX2: compare to new SSE 4.1, current AVX2
* AVX512 with 256-bit vectors ("Avx256"): compare to new AVX2
I plan on collecting data on 3 laptop processors (Haswell, Skylake and Tiger
Lake) and 2 desktop processors (Coffee Lake and Skylake Extreme). The Skylake
should match the performance of almost all the Skylake and derivatives since
2016; the Coffee Lake NUC has the same processor as my Mac Mini; the Tiger Lake
should be the performance of modern processors. The Skylake Extreme and the
Tiger Lake can run the AVX512 code too. I don't know if the AVX512 code on
Skylake will show a performance gain or a loss, because despite using only 256
bits, it may need to power on the OpMask registers. If it doesn't, I will
adjust the feature detection to only apply to Ice Lakes and later.
I have a new Alder Lake which would be nice to benchmark, to get the
performance on both the Golden Cove P-core and the Gracemont E-core, but the
thing runs Windows and the IT-mandated virus scans, so I will not bother.
Thiago Macieira - thiago.macieira (AT) intel.com
Cloud Software Architect - Intel DCAI Cloud Engineering
More information about the Development