[Development] Mutex future directions

Fri May 18 22:45:22 CEST 2012

On Friday 18 May 2012 20:34:51 Thiago Macieira wrote:
> Hello
> 
> I've just completed a review of the QMutex & family internals, as well as
> the Intel optimisation manuals about threading, data sharing, locking and
> the Transactional Memory extensions.
> 
> Short story: I recommend de-inlining QMutex code for Qt 5.0. We can
> re-inline later. I also recommend increasing the QBasicMutex / QMutex size
> to accommodate more data without the pointer indirection. Further, some
> work needs to be done to support Valgrind. Details below.

The inlining showed to improve a lot the performences for performence-critical 
operation where two function calls matters. It improved performence of signal 
emition for example.
Now, if you can prove me that the code of the function call is neglectible 
compared to the atomic operation, maybe i can change my mind. But last time i 
checked, it was usefull.

As for the size, I think it is really nice to have a light QMutex. That way, 
you can have more QMutex, and then more granularity.
None of your arguments convinces me it is a good idea to increase the size. 
Assigning a QMutexPrivate to a contended QMutex should be fast enough.

> Long story follows.
> 
> Current state;
> 
> QBasicMutex is an internal POD class that offers non-recursive locking. It's
> incredibly efficient for Linux, where the single pointer-sized member
> variable is enough to execute the futex operations, in all cases. For other
> platforms, we get the same efficiency in non-contended cases, but incur a
> non-negligible performance penalty when contention happens. 

Which non-negligible performance penalty are you takling about here?

> QBasicMutex's documentation also has a note saying that timed locks may not
> work and may cause memory leaks.

It works.  But it might leak the QMutexPrivate if the QMutex destructor is not 
run.  This is ususally not a problem because QBasicMutex is meant to be a 
global object. (also, who uses tryLock with a timeout?)

> QMutex builds on QBasicMutex. It fixes the memory leak (I assume) and
> provides support for recursive mutexes. A recursive QMutex is simply a
> QMutex with an owner ID and a recursion counter.

Notice that recusrsive mutex should be advised against.

(google for recursive mutex:  
http://www.zaval.org/resources/library/butenhof1.html )

> QWaitCondition, QSemaphore, and QReadWriteLock are not optimised at all[...]

Yes, feel free to optimize, but this is unrelated to QMutex.

> Valgrind's helgrind and DRD tools can currently operate on Qt locks by
> preloading a library and hijacking QMutex functions. I have tested
> helgrinding Qt4 applications in the past. I do not see a library version
> number in the symbols, so it's possible that helgrind would work unmodified
> with Qt 5 if we were to de-inline the functions. However, we should
> probably approach the Valgrind community to make sure that the Qt 5 mutexes
> work.

Yes, there should be some NOOP we could introduce in debug to hint those 
tools.
But that can work on the inline code.

> The Intel optimisation manual says that data sharing is most efficient when
> each thread or core is operating on a disjoint set of cachelines. That is,
> if you have two threads running, they are most there are no writes by both
> threads to the same cacheline (or cache sector of 128 bytes on Pentium 4).
> Shared reading is fine. While this is the Intel optimisation manual, the
> recommendation is probably a good rule of thumb for any architecture.
> 
> There some other optimisation hints about using pipelined locks and about
> cache aliasing at 64 kB and 1 MB, but those are higher-level problems than
> we can solve at the lock level.
> 
> The TSX manual says that transactional memory contention happens at the
> cacheline level. That is, if the transaction reads from a cacheline that is
> modified outside the transaction, or if the transaction writes to a
> cacheline that is read from or written to outside the transaction. I do not
> believe this to be more of a problem than the above optimisation guideline,
> which is something for the higher-level organisation: do not put two
> independent lock variables in the same cacheline.
> 
> There are two types of instructions to start and finish a transaction. One
> pair is backwards compatible with existing processors and could be inserted
> to every single mutex lock and unlock, even in inline code (which might
> serve as a hint to valgrind, for example). The other pair requires checking
> the processor CPUID first.
> 
> However, there are no processors in the market with transactional memory
> support and I don't have access to a prototype (yet, anyway), so at this
> point we simply have no idea whether enabling transactions for all mutex
> locks is a good idea. If enabling for them all isn't a good idea, the code
> for being adaptative cannot be inlined and we're further bound by the
> current inline code in QMutex.

That is not relevant.
Transactional memory requires a different set of primitives. QMutex has 
nothing to do with transactional memory.

> Recommendations (priority):
> 
> (P0) de-inline QBasicMutex locking functions until we solve some or all of
> the below problems

What problems exactly?

> (P1) expand the size of QBasicMutex and QMutex to accommodate more data, for
> Mac and Windows. On Windows, the extra data required is one pointer (a
> HANDLE), whereas on Mac it's a semaphore_t. Depending on the next task, we
> might need a bit more space. At first glance, all three implementations
> would work fine with a lock state and the wait structure -- with the futex
> implementation doing both in one single integer.

Those HANDLE or semaphore_t have much higher costs. They need to be created 
and destroyed. And they allocate some private data as well. 

semaphore_t is a good example. We tried to use the semaphore on mac Qt 4.8. 
But it has to be reverted back to pthread. The reason was a huge preformence 
regression in constructing QMutex (creating the semaphore was too slow)

In Qt 5, only few semaphore_t are created, and shared between mutexes.

> (P1) approach the Valgrind developers to ensure that helgrind and DRD work
> with Qt 5
>
> (P2) optimise the Mac and Windows implementations to avoid the need for
> allocating a dynamic d pointer in case of contention. In fact, remove the
> need for dynamic d-pointer allocation altogether: Mac, Windows and Linux
> should never do it, while the generic Unix implementation should do it all
> the time in the constructor

The allocation of the d-ponter is not that dynamic.
It is already quite optimized in Qt5

> (P2) investigate whether the recursive QMutex can benefit from the expanded
> size of QMutex too

We should not care that much about recursive mutex.

> (P2) analyse QReadWriteLock and see if expanding the size of the structure
> like QMutex would be beneficial

I had tought about implementing QReadWriteLock on top of QBasicMutex.  I 
beleive it is possible to add a QBasicMutex::lockForRead,  and increment the 
d_ptr for every reader.  Then if it is bigger than, say, 1024, consider it is 
a pointer.

> (P3) investigate TSX support for QMutex, whether by using HLE or RTM, if
> unconditional or adaptative -- make sure that QMutex unlocking by way of
> QWaitCondition waiting has the correct semantics
> 
> (P3) investigate TSX support for QReadWriteLock
> 
> (P4) optimise the implementations (at least the Linux one) by reading the
> assembly
> 
> If, at the completion of the above tasks, we conclude that inlining QMutex
> locking would be beneficial, with minimal side-effects, we can re-inline it.
> The same applies to QReadWriteLock.

I beleive there is more important priority right now than touching QMutex, 
which already had its lifting for Qt5.

-- 
Olivier

Woboq - Qt services and support - http://woboq.com