[Development] Mutex future directions

Fri May 18 20:34:51 CEST 2012

Hello

I've just completed a review of the QMutex & family internals, as well as the 
Intel optimisation manuals about threading, data sharing, locking and the 
Transactional Memory extensions.

Short story: I recommend de-inlining QMutex code for Qt 5.0. We can re-inline 
later. I also recommend increasing the QBasicMutex / QMutex size to 
accommodate more data without the pointer indirection. Further, some work 
needs to be done to support Valgrind. Details below.

Long story follows.

Current state;

QBasicMutex is an internal POD class that offers non-recursive locking. It's 
incredibly efficient for Linux, where the single pointer-sized member variable 
is enough to execute the futex operations, in all cases. For other platforms, 
we get the same efficiency in non-contended cases, but incur a non-negligible 
performance penalty when contention happens. QBasicMutex's documentation also 
has a note saying that timed locks may not work and may cause memory leaks.

QMutex builds on QBasicMutex. It fixes the memory leak (I assume) and provides 
support for recursive mutexes. A recursive QMutex is simply a QMutex with an 
owner ID and a recursion counter.

QWaitCondition, QSemaphore, and QReadWriteLock are not optimised at all. 
QWaitCondition is implemented by using pthread on Unix and an event queue on 
Windows. QSemaphore further builds upon them by using one QMutex and one 
QWaitCondition. QReadWriteLock has one QMutex, two QWaitConditions and some 
other private data.

Valgrind's helgrind and DRD tools can currently operate on Qt locks by 
preloading a library and hijacking QMutex functions. I have tested helgrinding 
Qt4 applications in the past. I do not see a library version number in the 
symbols, so it's possible that helgrind would work unmodified with Qt 5 if we 
were to de-inline the functions. However, we should probably approach the 
Valgrind community to make sure that the Qt 5 mutexes work.

The Intel optimisation manual says that data sharing is most efficient when each 
thread or core is operating on a disjoint set of cachelines. That is, if you 
have two threads running, they are most there are no writes by both threads to 
the same cacheline (or cache sector of 128 bytes on Pentium 4). Shared reading 
is fine. While this is the Intel optimisation manual, the recommendation is 
probably a good rule of thumb for any architecture.

There some other optimisation hints about using pipelined locks and about 
cache aliasing at 64 kB and 1 MB, but those are higher-level problems than we 
can solve at the lock level.

The TSX manual says that transactional memory contention happens at the 
cacheline level. That is, if the transaction reads from a cacheline that is 
modified outside the transaction, or if the transaction writes to a cacheline 
that is read from or written to outside the transaction. I do not believe this 
to be more of a problem than the above optimisation guideline, which is 
something for the higher-level organisation: do not put two independent lock 
variables in the same cacheline.

There are two types of instructions to start and finish a transaction. One pair 
is backwards compatible with existing processors and could be inserted to 
every single mutex lock and unlock, even in inline code (which might serve as 
a hint to valgrind, for example). The other pair requires checking the 
processor CPUID first.

However, there are no processors in the market with transactional memory 
support and I don't have access to a prototype (yet, anyway), so at this point 
we simply have no idea whether enabling transactions for all mutex locks is a 
good idea. If enabling for them all isn't a good idea, the code for being 
adaptative cannot be inlined and we're further bound by the current inline 
code in QMutex.

Recommendations (priority):

(P0) de-inline QBasicMutex locking functions until we solve some or all of the 
below problems

(P1) expand the size of QBasicMutex and QMutex to accommodate more data, for 
Mac and Windows. On Windows, the extra data required is one pointer (a 
HANDLE), whereas on Mac it's a semaphore_t. Depending on the next task, we 
might need a bit more space. At first glance, all three implementations would 
work fine with a lock state and the wait structure -- with the futex 
implementation doing both in one single integer.

(P1) approach the Valgrind developers to ensure that helgrind and DRD work 
with Qt 5

(P2) optimise the Mac and Windows implementations to avoid the need for 
allocating a dynamic d pointer in case of contention. In fact, remove the need 
for dynamic d-pointer allocation altogether: Mac, Windows and Linux should 
never do it, while the generic Unix implementation should do it all the time 
in the constructor

(P2) investigate whether the recursive QMutex can benefit from the expanded 
size of QMutex too

(P2) analyse QReadWriteLock and see if expanding the size of the structure 
like QMutex would be beneficial

(P3) investigate TSX support for QMutex, whether by using HLE or RTM, if 
unconditional or adaptative -- make sure that QMutex unlocking by way of 
QWaitCondition waiting has the correct semantics

(P3) investigate TSX support for QReadWriteLock

(P4) optimise the implementations (at least the Linux one) by reading the 
assembly

If, at the completion of the above tasks, we conclude that inlining QMutex 
locking would be beneficial, with minimal side-effects, we can re-inline it. The 
same applies to QReadWriteLock.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
     Intel Sweden AB - Registration Number: 556189-6027
     Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120518/15c8ed0b/attachment.sig>