[Development] How qAsConst and qExchange lead to qNN

Wed Nov 16 08:28:46 CET 2022

Hi Thiago,

On 15.11.22 17:25, Thiago Macieira wrote:
> On Tuesday, 15 November 2022 00:52:24 PST Marc Mutz via Development wrote:
>> That remains to be proven. A rule of thumb for atomics is that they're
>> two orders of magnitude slower than a normal int. They also still act as
>> optimizer firewalls. With that rule of thumb, copying 50 char16_t's is
>> faster than one ref-count update. What really is the deciding point is
>> whether or not there's a memory allocation involved. I mentioned that
>> for many use-cases, therefore, a non-CoW SBO container is preferable over a
>> CoW non-SBO one.
> 
> That's irrelevant so long as we don't have SBO containers.

We have not-quite-SBO-but-close QVLA for arbitrary types, and we have 
std::string as a stand-in for QByteArray and u16string for QString. 
There are also tons of such container classes in 3rd-party libraries, 
which suddenly become viable because we've decoupled the API from the 
implementation.

> So what we need to really compare are memory allocations versus the atomics. A
> locked operation on a cacheline on x86 will take in the order of 20 cycles of
> latency on top of any memory delays[1], but do note the CPU keeps running
> meanwhile (read: an atomic inc has a much smaller impact than an atomic dec
> that uses the result). A memory allocation for a single byte will have an
> impact bigger than this, hundreds of cycles.
> 
> Therefore, in the case of CoW versus deep copy, CoW always wins.
> 
> [1] https://uops.info/html-instr/INC_LOCK_M32.html says 23 cycles on an 11-
> year-old Sandy Bridge, 19 on Haswell, 18 on everything since Skylake.

In the problematic case where a temporary container is created at the 
call site for the sole purpose of providing function arguments, it's 
specifically the dec, though, which is problematic (deref-to-zero has an 
acquire fence). The compiler cannot prove that the atomic hasn't been 
manipulated, so it can't optimize the deref out and go directly to 
deallocation. This includes the case of a defaulted extra argument 
(https://bugreports.qt.io/browse/QTBUG-98117), maybe not with QString, 
anymore, but, most recently, with QKeySequence (which prompted the 
addActions() revamp).

Seeing as destruction of temps is sequenced before the end of the 
full-expression and seeing as atomics are synchronization points, the 
C++ memory model says that these atomic decs _will_ hold up execution of 
the next statements.

Thanks,
Marc

-- 
Marc Mutz <marc.mutz at qt.io>
Principal Software Engineer

The Qt Company
Erich-Thilo-Str. 10 12489
Berlin, Germany
www.qt.io

Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen
Sitz der Gesellschaft: Berlin,
Registergericht: Amtsgericht Charlottenburg,
HRB 144331 B