[Development] Mutex future directions

Sat May 19 18:34:30 CEST 2012

Hi,

Regarding valgrind:
 *) On debug build, nothing is inlined.
 *) If we keep it inline, then we would just need a patch like this [1]

Regarding Transactional memory:
 *) Very interesting
 *) Notice the end of section 8.2.1: "Improper use of hints will not cause    
    functional bugs though it may expose latent bugs already in the code.".
    So in other words, we can use XAQUIRE and XRELEASE without any problem in
    inline code, without binary compaibility issue
 *) The other transactional model does not really apply, but even if it would,
    we could still use some different value for locked and unlocked and the  
    previously inline code falls back to the non-inline code.
 *) debugging tools (valgrind) could also detect the XAQUIRE and XRELEASE
    prefix.

Regarding increasing the size
 *) I think it is valuable to have small memory overhead per mutexes. 
 *) I think recursive mutex don't deserve improvements on the detriment of
    normal ones
 *) I think there would not be improvement on Windows.

[1] http://paste.kde.org/482474/

-- 
Olivier

Woboq - Qt services and support - http://woboq.com

On Saturday 19 May 2012 10:26:03 Thiago Macieira wrote:
> > > Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
> > > resources. It will consume front-end decoding, one ALU port, the
> > > retirement, not to mention increased code size. There's no free lunch.
> > 
> > Slower than a function call (which will likely spill register)?  I don't
> > beleive so.
> 
> No, not slower. But there's a cost anyway, including a cost that isn't
> measured in CPU cycles -- the cost of opportunity of debugging.
> 
> > (moreever, we are talking about debug build, right?)
> 
> No. We're talking about release builds too.
> 
> > > > That is not relevant.
> > > > Transactional memory requires a different set of primitives. QMutex
> > > > has
> > > > nothing to do with transactional memory.
> > > 
> > > Then you didn't read the manual. I'm going to ignore what you said until
> > > you read the manual because transactional memory has everythig to do
> > > with
> > > QMutex.
> > > 
> > > Please, either believe me or read the manual.
> > 
> > Do you have a link to that manuel?
> 
> http://software.intel.com/file/41604 (see chapter 8)
> 
> See also
> http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions
> 
> > Transactional memory is about detecting conflicts between transactions,
> > and
> > rolling back.
> > Mutexes are about locking until the resource is free
> 
> And the Intel TSX allows us to do both: they allow us to remove the lock and
> execute the same code optimistically, hoping there are no conflicts. If
> there are, it will "roll back" and will lock; if there wasn't, when you
> unlock it will atomically commit the transaction.
> 
> Our QMutex is optimised for the non-contended case. The TSX will allow us to
> to even optimise the contended case where there was no data confilct.
> > Transactional memory could be used to simplify the code that allocate the
> > QMutexPrivate.
> > 
> > But I doubt the hypothetical gain is even worth the function call.
> > 
> > Because remember the uncontended case is much more critical than the
> > contended case. (There is no point in winning a dozens of cycles if we are
> > going to pay anyway several hendreds in the context switch that follow)
> > 
> > But fast uncontended case is critical because it allows the use of mutex
> > in
> > part that might or not be used with threads. Example: QObject. We need to
> > put mutexes while emiting a signal because maybe it is used in multiple
> > thread. But we don't want to pay too much when used in a single thread
> > (the
> > common case)
> 
> That's a very good example of where we want QMutex to be transactional: we
> need the mutex because we need to be sure that no other thread is modifying
> the connection list while the signal activation is reading the lists.
> However, we know that it's an extremely rare case for that to happen. If we
> have TSX in the signal emission path, then two threads could read from the
> signal list simultaneously.
> 
> In other words, with TSX, QMutes becomes a QReadWriteLock with no extra data
> cost.
> 
> > > My suggestion was to avoid the QMutexPrivate allocation on Mac and
> > > Windows,
> > > since they require just a bit of initialisation. Now, given what you
> > > said
> > > about semaphore_t, we may not be able to do it on the Mac. But we can
> > > try
> > > to apply the same optimisation for Windows -- the initialisation is a
> > > call to CreateEvent.
> > 
> > But CreateEvent still probably allocate memory behind.
> 
> True, but will you let me at least try?
> 
> If you have data saying that allocating a windows event with CreateEvent for
> every single QMutex has a higher cost than the free list, please add that
> comment to qmutex_win.cpp.
> 
> If you don't have such data, then I will try this.
> 
> > > I disagree. If we shoot ourselves in the foot by not being able to
> > > support
> > > TSX and valgrind in Qt 5, we've lost a lot. That's why I propose
> > > de-inlining for 5.0, so we have enough time to investigate those
> > > potential
> > > drawbacks.
> > 
> > I think the inline is not a problem for valgrind.
> 
> Then please add *now* the necessary valgrind macros and prove to me that it
> works.
> 
> Note that proving that it works is a necessary condition for keeping them
> inline. It's not a sufficient condition.
> 
> > And I don't think we can gain much with TSX.
> 
> I think differently. Since neither of us has Haswell hardware yet to try,
> then we simply have two opinions with no way to be sure who's right.
> 
> Except that, if I'm right, then we may have shot ourselves in the foot. I'm
> simply asking that we point the gun away from our feet right now, until we
> find out more.
> 
> > Is it not however shooting ourself in the feet not to inline it? Because
> > we
> > hardly can inline it later in a binary compatible way.
> 
> Actually, I think I wasn't clear here. I want to leave the lock() and
> unlock() functions inline. But I'd like to change them so that they call
> lockInternal() and unlockInternal() directly in 5.0, without the
> compare-and-exchanges.
> 
> If in 5.1 we decide that it was fine to inline the locking and unlocking
> code, then we can simply go back. The internal functions don't change
> because of that: they need to deal with the possibility of the mutex having
> been unlocked behind their backs anyway.