[Development] Wishes for C++ standard or compilers

Thu Mar 30 00:59:29 CEST 2017

On quarta-feira, 29 de março de 2017 08:44:35 PDT Thiago Macieira wrote:
> Without destructive moves:
>  1) allocate new block
>  2) move each element (constructor is noexcept, good)
>   -> could be a memcpy + memset, but isn't...
>  3) destroy each element (refcount is checked every time!)
>  4) deallocate the old block
> 
> With destructive moves:
>  1) allocate new block
>  2) destructive-move every element (a memcpy, whether explicit or not)
>  3) deallocate the old block
> 
> With trivial destructive moves:
>  1) realloc
> 
> Do we need to actually benchmark this?

Ok, so I have. This is of course a microbenchmark.

Source code: http://paste.opensuse.org/83426813
[Requires my branch's QArrayData]

Without further ado, results. Analysis and conclusions below.

GCC 6.3:,CPU cycles,% of move,
copyTrivial,"7.015.001,04688",105%,
moveTrivial,"6.653.361,60938",100%,
memcpyTrivial,"5.556.582,67969",84%,
reallocTrivial,"3.209.727,01953",48%,
copyComplex,"50.938.402,71875",99%,
moveComplex,"51.306.523,56250",100%,
memcpyComplex,"40.689.058,50000",79%,
reallocComplex,"13.450.587,92188",26%,
,,,
Clang trunk:,,,
copyTrivial,"6.366.532,53906",101%,
moveTrivial,"6.284.109,14063",100%,
memcpyTrivial,"6.410.652,50000",102%,
reallocTrivial,"3.926.226,85547",62%,
copyComplex,"44.778.841,93750",84%,
moveComplex,"53.097.208,06250",100%,
memcpyComplex,"38.966.088,87500",73%,
reallocComplex,"12.701.627,09375",24%,
,,,
ICC 17:,,,
copyTrivial,"7.656.487,46094",100%,
moveTrivial,"7.688.472,17969",100%,
memcpyTrivial,"8.346.315,19531",109%,
reallocTrivial,"4.885.068,30859",64%,
copyComplex,"68.326.384,37500",91%,
moveComplex,"75.145.256,50000",100%,
memcpyComplex,"52.546.942,75000",70%,
reallocComplex,"29.427.426,37500",39%,

Analysis:

1) I've valground the application and it leaks no memory

2) the application was compiled with exceptions disabled, as we're not testing 
the exceptional code paths. This also simulates a complex type that has a 
noexcept move constructor.

3) because this is using QArrayData allocation with GrowsForward, the 
reallocation strategy does not happen on every execution. In fact, to append 
one million items, the allocate() function is called only 59 times.

4) the benchmarks only the appending into the vector. The freeing of the 
vector's elements at the end is not included.

5) there are four times two tests:
	copy + destroy original
	move + destroy original
	memcpy
	pure realloc
run each for
	a trivial type (int)
	a complex but relocatable type (QString)

6) the test is skewed towards realloc because there's no other memory 
allocation, so realloc() is free to resize the memory block in-place, as it 
sees fit. Tough luck for the other tests.

7) the copy and move operations for the trivial type are, as expected, within 
the noise of each other. The 5% of the spike for GCC does not appear in all 
runs (the first test suffers a little due to warming up and the need to obtain 
memory from the OS).

8) the difference between copy/move and memcpy for the trivial type depend on 
the compiler. I compiled using -O2, which means GCC did not use its 
vectoriser, but Clang did (-march=native, so they used AVX2). Meanwhile, all 
three called the libc memcpy function in the memcpy version, which has 
vectorisation (in ICC's case, it called __intel_avx_rep_memcpy).

Conclusions:

A) for trivial types, the relocation provides negligible, if any, improvement, 
as the operation is already there. But as shown, not all compilers are as 
smart and may miss optimisation opportunities by its absence. The more 
information the compiler has, the better it can generate code.

B) for non-trivial types, the relocation provides 25% or more performance 
improvement (1 / 80% = 125%). That's probably because QString's destructor is 
actually non-negligible, as it needs to check if the data block it points to 
needs to be freed.

C) the realloc strategy is clearly the winner here, achieving up to 316% 
improvement over today's state-of-the-art. As stated in the analysis, this 
happens with only 59 allocations out of 1 million elements appended. Now, 
since nothing else was allocated in this benchmark, realloc() most likely did 
not have to perform memcpy, so these numbers are not representative of all 
applications, but they are possible and can even be surpassed with dedicated 
arena allocators.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center