[Development] How qAsConst and qExchange lead to qNN

Thu Nov 17 19:56:22 CET 2022

On Thursday, 17 November 2022 10:24:35 PST Volker Hilsheimer via Development 
wrote:
> > Though I am postponing the QString vectorisation update to 6.6 because I
> > don't have time to collect the benchmarks to prove I'm right before
> > feature freeze next Friday.
> 
> Next Friday is the platform & module freeze. Feature freeze is not until
> December 9th, i.e. another 3 weeks to go.

Next Friday is also the day after Thanksgiving here in the US.

I don't expect I can finish the benchmarking in 3 weeks, not considering I need 
to finish the IPC work and that includes starting a couple of changes that I 
haven't started yet (like the ability to clean up after itself).

For the benchmarking, I've already collected the data by instrumenting each of 
the functions in question and running a Qt build, a Qt Creator start and a Qt 
build inside Qt Creator:

qt-build-data.tar.xz: 1197.3 MB
qtcreator-nosession.tar.xz: 2690.0 MB
qtcreator-session.tar.xz: 35134.6 MB

The data retains its intra-cacheline alignment.

The way I'm seeing it, is that for each of the algorithm generations, I need 
to:
1) find the asymptotic limits, given L1, L2 and L3 cache sizes
That is, the algorithms should be fast enough that the bottleneck is the 
transfer of data. There's no way that running qustrchr on 35 GB is going to be 
bound by anything other than RAM bandwidth or, in my laptop's case, the NVMe. 
So what are those limits?

2) benchmark at several data set sizes (half to 75% of L1, half to 75% of L2) 
on several generations
Confirm that the algorithm is running close to or better than the ideal run 
that LLVM-MCA showed when I designed them. I know I can benchmark throughput 
to see if we're reaching the target bytes/cycle processing, but I don't know 
if I can benchmark the latency. I also don't know if it matters.

3) benchmark at several input sizes (i.e., strings of 4 characters, 8 
characters, etc.)
Same as #2, but instead of running over the sample that adds up to a certain 
data size, select the input such that the strings have always the same size.

4) compare to the previous generation's algorithm to confirm it's actually 
better
Different instructions have different pros and cons; what might work for one at 
a given data size may not for another

The algorithms available are:
* baseline SSE2: no comparisons
* SSE 4.1: compare to baseline SSE2, current SSE 4.1
* AVX2: compare to new SSE 4.1, current AVX2
* AVX512 with 256-bit vectors ("Avx256"): compare to new AVX2

I plan on collecting data on 3 laptop processors (Haswell, Skylake and Tiger 
Lake) and 2 desktop processors (Coffee Lake and Skylake Extreme). The Skylake 
should match the performance of almost all the Skylake and derivatives since 
2016; the Coffee Lake NUC has the same processor as my Mac Mini; the Tiger Lake 
should be the performance of modern processors. The Skylake Extreme and the 
Tiger Lake can run the AVX512 code too. I don't know if the AVX512 code on 
Skylake will show a performance gain or a loss, because despite using only 256 
bits, it may need to power on the OpMask registers. If it doesn't, I will 
adjust the feature detection to only apply to Ice Lakes and later.

I have a new Alder Lake which would be nice to benchmark, to get the 
performance on both the Golden Cove P-core and the Gracemont E-core, but the 
thing runs Windows and the IT-mandated virus scans, so I will not bother.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering