[Development] Mutex future directions

Sat May 19 10:26:03 CEST 2012

On sábado, 19 de maio de 2012 00.36.56, Olivier Goffart wrote:
> On Friday 18 May 2012 23:25:47 Thiago Macieira wrote:
> > > Which non-negligible performance penalty are you takling about here?
> > 
> > The need to allocate memory and do the test-and-set loop for setting the
> > private.
> 
> There is no memory allocation, it is using the QFreeList.
> And this happens when we are about to do a context switch. So I beleive the
> test-and-set loop should be neglectible.

True. My point is that it has a penalty higher than the Linux futex 
implementation does and, at least on the Windows case, I think it's possible 
to avoid it.

> > Right, and users use the classes as intended... :-P
> 
> (QBasicMutex is an undocumented internal class)

Again, true, but even our own developers make mistakes. As exhibit A, take 
Q_GLOBAL_STATIC_WITH_INITIALIZER. See my email 
"Q_GLOBAL_STATIC_WITH_INITIALIZER is harmful; please stop using it".

I'm not sure there's a way out for QBasicMutex, not without making it non-POD.

> > Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
> > resources. It will consume front-end decoding, one ALU port, the
> > retirement, not to mention increased code size. There's no free lunch.
> 
> Slower than a function call (which will likely spill register)?  I don't
> beleive so.

No, not slower. But there's a cost anyway, including a cost that isn't 
measured in CPU cycles -- the cost of opportunity of debugging.

> (moreever, we are talking about debug build, right?)

No. We're talking about release builds too.

> > > That is not relevant.
> > > Transactional memory requires a different set of primitives. QMutex has
> > > nothing to do with transactional memory.
> > 
> > Then you didn't read the manual. I'm going to ignore what you said until
> > you read the manual because transactional memory has everythig to do with
> > QMutex.
> > 
> > Please, either believe me or read the manual.
> 
> Do you have a link to that manuel?

http://software.intel.com/file/41604 (see chapter 8)

See also
http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

> Transactional memory is about detecting conflicts between transactions, and
> rolling back.
> Mutexes are about locking until the resource is free

And the Intel TSX allows us to do both: they allow us to remove the lock and 
execute the same code optimistically, hoping there are no conflicts. If there 
are, it will "roll back" and will lock; if there wasn't, when you unlock it 
will atomically commit the transaction.

Our QMutex is optimised for the non-contended case. The TSX will allow us to 
to even optimise the contended case where there was no data confilct.

> Transactional memory could be used to simplify the code that allocate the
> QMutexPrivate.
> 
> But I doubt the hypothetical gain is even worth the function call.
> 
> Because remember the uncontended case is much more critical than the
> contended case. (There is no point in winning a dozens of cycles if we are
> going to pay anyway several hendreds in the context switch that follow)
> 
> But fast uncontended case is critical because it allows the use of mutex in
> part that might or not be used with threads. Example: QObject. We need to
> put mutexes while emiting a signal because maybe it is used in multiple
> thread. But we don't want to pay too much when used in a single thread (the
> common case)

That's a very good example of where we want QMutex to be transactional: we 
need the mutex because we need to be sure that no other thread is modifying 
the connection list while the signal activation is reading the lists. However, 
we know that it's an extremely rare case for that to happen. If we have TSX in 
the signal emission path, then two threads could read from the signal list 
simultaneously.

In other words, with TSX, QMutes becomes a QReadWriteLock with no extra data 
cost.

> > My suggestion was to avoid the QMutexPrivate allocation on Mac and
> > Windows,
> > since they require just a bit of initialisation. Now, given what you said
> > about semaphore_t, we may not be able to do it on the Mac. But we can try
> > to apply the same optimisation for Windows -- the initialisation is a
> > call to CreateEvent.
> 
> But CreateEvent still probably allocate memory behind.

True, but will you let me at least try? 

If you have data saying that allocating a windows event with CreateEvent for 
every single QMutex has a higher cost than the free list, please add that 
comment to qmutex_win.cpp.

If you don't have such data, then I will try this.

> > I disagree. If we shoot ourselves in the foot by not being able to support
> > TSX and valgrind in Qt 5, we've lost a lot. That's why I propose
> > de-inlining for 5.0, so we have enough time to investigate those potential
> > drawbacks.
> 
> I think the inline is not a problem for valgrind.

Then please add *now* the necessary valgrind macros and prove to me that it 
works.

Note that proving that it works is a necessary condition for keeping them 
inline. It's not a sufficient condition.

> And I don't think we can gain much with TSX.

I think differently. Since neither of us has Haswell hardware yet to try, then 
we simply have two opinions with no way to be sure who's right.

Except that, if I'm right, then we may have shot ourselves in the foot. I'm 
simply asking that we point the gun away from our feet right now, until we find 
out more.

> Is it not however shooting ourself in the feet not to inline it? Because we
> hardly can inline it later in a binary compatible way.

Actually, I think I wasn't clear here. I want to leave the lock() and unlock() 
functions inline. But I'd like to change them so that they call lockInternal() 
and unlockInternal() directly in 5.0, without the compare-and-exchanges.

If in 5.1 we decide that it was fine to inline the locking and unlocking code, 
then we can simply go back. The internal functions don't change because of 
that: they need to deal with the possibility of the mutex having been unlocked 
behind their backs anyway.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
     Intel Sweden AB - Registration Number: 556189-6027
     Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120519/6bef3b77/attachment.sig>