[Development] Wishes for C++ standard or compilers
Thiago Macieira
thiago.macieira at intel.com
Thu Mar 30 00:59:29 CEST 2017
On quarta-feira, 29 de março de 2017 08:44:35 PDT Thiago Macieira wrote:
> Without destructive moves:
> 1) allocate new block
> 2) move each element (constructor is noexcept, good)
> -> could be a memcpy + memset, but isn't...
> 3) destroy each element (refcount is checked every time!)
> 4) deallocate the old block
>
> With destructive moves:
> 1) allocate new block
> 2) destructive-move every element (a memcpy, whether explicit or not)
> 3) deallocate the old block
>
> With trivial destructive moves:
> 1) realloc
>
> Do we need to actually benchmark this?
Ok, so I have. This is of course a microbenchmark.
Source code: http://paste.opensuse.org/83426813
[Requires my branch's QArrayData]
Without further ado, results. Analysis and conclusions below.
GCC 6.3:,CPU cycles,% of move,
copyTrivial,"7.015.001,04688",105%,
moveTrivial,"6.653.361,60938",100%,
memcpyTrivial,"5.556.582,67969",84%,
reallocTrivial,"3.209.727,01953",48%,
copyComplex,"50.938.402,71875",99%,
moveComplex,"51.306.523,56250",100%,
memcpyComplex,"40.689.058,50000",79%,
reallocComplex,"13.450.587,92188",26%,
,,,
Clang trunk:,,,
copyTrivial,"6.366.532,53906",101%,
moveTrivial,"6.284.109,14063",100%,
memcpyTrivial,"6.410.652,50000",102%,
reallocTrivial,"3.926.226,85547",62%,
copyComplex,"44.778.841,93750",84%,
moveComplex,"53.097.208,06250",100%,
memcpyComplex,"38.966.088,87500",73%,
reallocComplex,"12.701.627,09375",24%,
,,,
ICC 17:,,,
copyTrivial,"7.656.487,46094",100%,
moveTrivial,"7.688.472,17969",100%,
memcpyTrivial,"8.346.315,19531",109%,
reallocTrivial,"4.885.068,30859",64%,
copyComplex,"68.326.384,37500",91%,
moveComplex,"75.145.256,50000",100%,
memcpyComplex,"52.546.942,75000",70%,
reallocComplex,"29.427.426,37500",39%,
Analysis:
1) I've valground the application and it leaks no memory
2) the application was compiled with exceptions disabled, as we're not testing
the exceptional code paths. This also simulates a complex type that has a
noexcept move constructor.
3) because this is using QArrayData allocation with GrowsForward, the
reallocation strategy does not happen on every execution. In fact, to append
one million items, the allocate() function is called only 59 times.
4) the benchmarks only the appending into the vector. The freeing of the
vector's elements at the end is not included.
5) there are four times two tests:
copy + destroy original
move + destroy original
memcpy
pure realloc
run each for
a trivial type (int)
a complex but relocatable type (QString)
6) the test is skewed towards realloc because there's no other memory
allocation, so realloc() is free to resize the memory block in-place, as it
sees fit. Tough luck for the other tests.
7) the copy and move operations for the trivial type are, as expected, within
the noise of each other. The 5% of the spike for GCC does not appear in all
runs (the first test suffers a little due to warming up and the need to obtain
memory from the OS).
8) the difference between copy/move and memcpy for the trivial type depend on
the compiler. I compiled using -O2, which means GCC did not use its
vectoriser, but Clang did (-march=native, so they used AVX2). Meanwhile, all
three called the libc memcpy function in the memcpy version, which has
vectorisation (in ICC's case, it called __intel_avx_rep_memcpy).
Conclusions:
A) for trivial types, the relocation provides negligible, if any, improvement,
as the operation is already there. But as shown, not all compilers are as
smart and may miss optimisation opportunities by its absence. The more
information the compiler has, the better it can generate code.
B) for non-trivial types, the relocation provides 25% or more performance
improvement (1 / 80% = 125%). That's probably because QString's destructor is
actually non-negligible, as it needs to check if the data block it points to
needs to be freed.
C) the realloc strategy is clearly the winner here, achieving up to 316%
improvement over today's state-of-the-art. As stated in the analysis, this
happens with only 59 allocations out of 1 million elements appended. Now,
since nothing else was allocated in this benchmark, realloc() most likely did
not have to perform memcpy, so these numbers are not representative of all
applications, but they are possible and can even be surpassed with dedicated
arena allocators.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
More information about the Development
mailing list