[Development] Mutex future directions
Thiago Macieira
thiago.macieira at intel.com
Fri May 18 20:34:51 CEST 2012
Hello
I've just completed a review of the QMutex & family internals, as well as the
Intel optimisation manuals about threading, data sharing, locking and the
Transactional Memory extensions.
Short story: I recommend de-inlining QMutex code for Qt 5.0. We can re-inline
later. I also recommend increasing the QBasicMutex / QMutex size to
accommodate more data without the pointer indirection. Further, some work
needs to be done to support Valgrind. Details below.
Long story follows.
Current state;
QBasicMutex is an internal POD class that offers non-recursive locking. It's
incredibly efficient for Linux, where the single pointer-sized member variable
is enough to execute the futex operations, in all cases. For other platforms,
we get the same efficiency in non-contended cases, but incur a non-negligible
performance penalty when contention happens. QBasicMutex's documentation also
has a note saying that timed locks may not work and may cause memory leaks.
QMutex builds on QBasicMutex. It fixes the memory leak (I assume) and provides
support for recursive mutexes. A recursive QMutex is simply a QMutex with an
owner ID and a recursion counter.
QWaitCondition, QSemaphore, and QReadWriteLock are not optimised at all.
QWaitCondition is implemented by using pthread on Unix and an event queue on
Windows. QSemaphore further builds upon them by using one QMutex and one
QWaitCondition. QReadWriteLock has one QMutex, two QWaitConditions and some
other private data.
Valgrind's helgrind and DRD tools can currently operate on Qt locks by
preloading a library and hijacking QMutex functions. I have tested helgrinding
Qt4 applications in the past. I do not see a library version number in the
symbols, so it's possible that helgrind would work unmodified with Qt 5 if we
were to de-inline the functions. However, we should probably approach the
Valgrind community to make sure that the Qt 5 mutexes work.
The Intel optimisation manual says that data sharing is most efficient when each
thread or core is operating on a disjoint set of cachelines. That is, if you
have two threads running, they are most there are no writes by both threads to
the same cacheline (or cache sector of 128 bytes on Pentium 4). Shared reading
is fine. While this is the Intel optimisation manual, the recommendation is
probably a good rule of thumb for any architecture.
There some other optimisation hints about using pipelined locks and about
cache aliasing at 64 kB and 1 MB, but those are higher-level problems than we
can solve at the lock level.
The TSX manual says that transactional memory contention happens at the
cacheline level. That is, if the transaction reads from a cacheline that is
modified outside the transaction, or if the transaction writes to a cacheline
that is read from or written to outside the transaction. I do not believe this
to be more of a problem than the above optimisation guideline, which is
something for the higher-level organisation: do not put two independent lock
variables in the same cacheline.
There are two types of instructions to start and finish a transaction. One pair
is backwards compatible with existing processors and could be inserted to
every single mutex lock and unlock, even in inline code (which might serve as
a hint to valgrind, for example). The other pair requires checking the
processor CPUID first.
However, there are no processors in the market with transactional memory
support and I don't have access to a prototype (yet, anyway), so at this point
we simply have no idea whether enabling transactions for all mutex locks is a
good idea. If enabling for them all isn't a good idea, the code for being
adaptative cannot be inlined and we're further bound by the current inline
code in QMutex.
Recommendations (priority):
(P0) de-inline QBasicMutex locking functions until we solve some or all of the
below problems
(P1) expand the size of QBasicMutex and QMutex to accommodate more data, for
Mac and Windows. On Windows, the extra data required is one pointer (a
HANDLE), whereas on Mac it's a semaphore_t. Depending on the next task, we
might need a bit more space. At first glance, all three implementations would
work fine with a lock state and the wait structure -- with the futex
implementation doing both in one single integer.
(P1) approach the Valgrind developers to ensure that helgrind and DRD work
with Qt 5
(P2) optimise the Mac and Windows implementations to avoid the need for
allocating a dynamic d pointer in case of contention. In fact, remove the need
for dynamic d-pointer allocation altogether: Mac, Windows and Linux should
never do it, while the generic Unix implementation should do it all the time
in the constructor
(P2) investigate whether the recursive QMutex can benefit from the expanded
size of QMutex too
(P2) analyse QReadWriteLock and see if expanding the size of the structure
like QMutex would be beneficial
(P3) investigate TSX support for QMutex, whether by using HLE or RTM, if
unconditional or adaptative -- make sure that QMutex unlocking by way of
QWaitCondition waiting has the correct semantics
(P3) investigate TSX support for QReadWriteLock
(P4) optimise the implementations (at least the Linux one) by reading the
assembly
If, at the completion of the above tasks, we conclude that inlining QMutex
locking would be beneficial, with minimal side-effects, we can re-inline it. The
same applies to QReadWriteLock.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
Intel Sweden AB - Registration Number: 556189-6027
Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120518/15c8ed0b/attachment.sig>
More information about the Development
mailing list