[Development] Mutex future directions

Fri May 18 23:25:47 CEST 2012

On sexta-feira, 18 de maio de 2012 22.45.22, Olivier Goffart wrote:
> On Friday 18 May 2012 20:34:51 Thiago Macieira wrote:
> > Hello
> > 
> > I've just completed a review of the QMutex & family internals, as well as
> > the Intel optimisation manuals about threading, data sharing, locking and
> > the Transactional Memory extensions.
> > 
> > Short story: I recommend de-inlining QMutex code for Qt 5.0. We can
> > re-inline later. I also recommend increasing the QBasicMutex / QMutex size
> > to accommodate more data without the pointer indirection. Further, some
> > work needs to be done to support Valgrind. Details below.
> 
> The inlining showed to improve a lot the performences for
> performence-critical operation where two function calls matters. It
> improved performence of signal emition for example.
> Now, if you can prove me that the code of the function call is neglectible
> compared to the atomic operation, maybe i can change my mind. But last time
> i checked, it was usefull.

No, I can't prove that. Of course that doing operations X and Y is more 
expensive than doing just Y.

The function calls do have an impact in performance. But it's not performance 
I want to trade off.

> As for the size, I think it is really nice to have a light QMutex. That way,
> you can have more QMutex, and then more granularity.
> None of your arguments convinces me it is a good idea to increase the size.
> Assigning a QMutexPrivate to a contended QMutex should be fast enough.

But it's not necessary. In fact, the size of QMutex would be platform 
dependent. On Linux, it would remain the size of a pointer, on Windows it 
would be two sizes of a pointer, on the Mac it would be the size of 
semaphore_t plus an integer (plus padding).

For a recursive mutex, I don't mind keeping the private. But I want to 
investigate whether we can have a recursive QMutex without indirection with 
just a few more bytes.

> > Long story follows.
> > 
> > Current state;
> > 
> > QBasicMutex is an internal POD class that offers non-recursive locking.
> > It's incredibly efficient for Linux, where the single pointer-sized
> > member variable is enough to execute the futex operations, in all cases.
> > For other platforms, we get the same efficiency in non-contended cases,
> > but incur a non-negligible performance penalty when contention happens.
> 
> Which non-negligible performance penalty are you takling about here?

The need to allocate memory and do the test-and-set loop for setting the 
private.

> > QBasicMutex's documentation also has a note saying that timed locks may
> > not
> > work and may cause memory leaks.
> 
> It works.  But it might leak the QMutexPrivate if the QMutex destructor is
> not run.  This is ususally not a problem because QBasicMutex is meant to be
> a global object. (also, who uses tryLock with a timeout?)

Right, and users use the classes as intended... :-P

> > Valgrind's helgrind and DRD tools can currently operate on Qt locks by
> > preloading a library and hijacking QMutex functions. I have tested
> > helgrinding Qt4 applications in the past. I do not see a library version
> > number in the symbols, so it's possible that helgrind would work
> > unmodified
> > with Qt 5 if we were to de-inline the functions. However, we should
> > probably approach the Valgrind community to make sure that the Qt 5
> > mutexes
> > work.
> 
> Yes, there should be some NOOP we could introduce in debug to hint those
> tools.
> But that can work on the inline code.

Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU 
resources. It will consume front-end decoding, one ALU port, the retirement, 
not to mention increased code size. There's no free lunch.

> > However, there are no processors in the market with transactional memory
> > support and I don't have access to a prototype (yet, anyway), so at this
> > point we simply have no idea whether enabling transactions for all mutex
> > locks is a good idea. If enabling for them all isn't a good idea, the code
> > for being adaptative cannot be inlined and we're further bound by the
> > current inline code in QMutex.
> 
> That is not relevant.
> Transactional memory requires a different set of primitives. QMutex has
> nothing to do with transactional memory.

Then you didn't read the manual. I'm going to ignore what you said until you 
read the manual because transactional memory has everythig to do with QMutex.

Please, either believe me or read the manual.

> > Recommendations (priority):
> > 
> > (P0) de-inline QBasicMutex locking functions until we solve some or all of
> > the below problems
> 
> What problems exactly?

The ones listed below:

> > (P1) expand the size of QBasicMutex and QMutex to accommodate more data,
> > for Mac and Windows. On Windows, the extra data required is one pointer
> > (a HANDLE), whereas on Mac it's a semaphore_t. Depending on the next
> > task, we might need a bit more space. At first glance, all three
> > implementations would work fine with a lock state and the wait structure
> > -- with the futex implementation doing both in one single integer.
> 
> Those HANDLE or semaphore_t have much higher costs. They need to be created
> and destroyed. And they allocate some private data as well.
> 
> semaphore_t is a good example. We tried to use the semaphore on mac Qt 4.8.
> But it has to be reverted back to pthread. The reason was a huge preformence
> regression in constructing QMutex (creating the semaphore was too slow)
> 
> In Qt 5, only few semaphore_t are created, and shared between mutexes.

Interesting, I didn't know that. Thanks for the information.

> > (P1) approach the Valgrind developers to ensure that helgrind and DRD work
> > with Qt 5
> > 
> > (P2) optimise the Mac and Windows implementations to avoid the need for
> > allocating a dynamic d pointer in case of contention. In fact, remove the
> > need for dynamic d-pointer allocation altogether: Mac, Windows and Linux
> > should never do it, while the generic Unix implementation should do it all
> > the time in the constructor
> 
> The allocation of the d-ponter is not that dynamic.
> It is already quite optimized in Qt5

My suggestion was to avoid the QMutexPrivate allocation on Mac and Windows, 
since they require just a bit of initialisation. Now, given what you said 
about semaphore_t, we may not be able to do it on the Mac. But we can try to 
apply the same optimisation for Windows -- the initialisation is a call to 
CreateEvent.

> > (P2) investigate whether the recursive QMutex can benefit from the
> > expanded
> > size of QMutex too
> 
> We should not care that much about recursive mutex.

Fair enough.

> > (P2) analyse QReadWriteLock and see if expanding the size of the structure
> > like QMutex would be beneficial
> 
> I had tought about implementing QReadWriteLock on top of QBasicMutex.  I
> beleive it is possible to add a QBasicMutex::lockForRead,  and increment the
> d_ptr for every reader.  Then if it is bigger than, say, 1024, consider it
> is a pointer.

I don't have a suggestion on how to implement it yet, though. I wanted to 
investigate a little to decide whether it was worth it expanding the structure 
before 5.0.

> > (P3) investigate TSX support for QMutex, whether by using HLE or RTM, if
> > unconditional or adaptative -- make sure that QMutex unlocking by way of
> > QWaitCondition waiting has the correct semantics
> > 
> > (P3) investigate TSX support for QReadWriteLock
> > 
> > (P4) optimise the implementations (at least the Linux one) by reading the
> > assembly
> > 
> > If, at the completion of the above tasks, we conclude that inlining QMutex
> > locking would be beneficial, with minimal side-effects, we can re-inline
> > it. The same applies to QReadWriteLock.
> 
> I beleive there is more important priority right now than touching QMutex,
> which already had its lifting for Qt5.

I disagree. If we shoot ourselves in the foot by not being able to support TSX 
and valgrind in Qt 5, we've lost a lot. That's why I propose de-inlining for 
5.0, so we have enough time to investigate those potential drawbacks.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
     Intel Sweden AB - Registration Number: 556189-6027
     Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/development/attachments/20120518/59620e89/attachment.sig>