[Development] Mutex future directions

Sat May 19 09:30:56 CEST 2012

A few comments from my side:

* I am not a big fan of inlining the public classes neither. It feels a
bit like over-optimizing at the wrong place. Reason is that this might
make it a impossible later on to refactor our code.
* It's ok to have some low level inline class for internal use if that
helps us in performance critical parts (such as signal/slot code). Nothing
says this and the public QMutex class have to be the same.
* IMO we should really try to allow using tools such as helgrind also with
release build
* it would be good to see how much of a real world (ie. not with
artificial benchmarks) difference you get due to inlining the mutex code.
Is it really relevant?

Cheers,
Lars

On 5/19/12 12:36 AM, "ext Olivier Goffart" <olivier at woboq.com> wrote:

>On Friday 18 May 2012 23:25:47 Thiago Macieira wrote:
>> > > QBasicMutex is an internal POD class that offers non-recursive
>>locking.
>> > > It's incredibly efficient for Linux, where the single pointer-sized
>> > > member variable is enough to execute the futex operations, in all
>>cases.
>> > > For other platforms, we get the same efficiency in non-contended
>>cases,
>> > > but incur a non-negligible performance penalty when contention
>>happens.
>> > 
>> > Which non-negligible performance penalty are you takling about here?
>> 
>> The need to allocate memory and do the test-and-set loop for setting the
>> private.
>
>There is no memory allocation, it is using the QFreeList.
>And this happens when we are about to do a context switch. So I beleive
>the 
>test-and-set loop should be neglectible.
>
>> Right, and users use the classes as intended... :-P
>
>(QBasicMutex is an undocumented internal class)
>
>> Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
>> resources. It will consume front-end decoding, one ALU port, the
>>retirement,
>> not to mention increased code size. There's no free lunch.
>
>Slower than a function call (which will likely spill register)?  I don't
>beleive so.
>
>(moreever, we are talking about debug build, right?)
>
>> > That is not relevant.
>> > Transactional memory requires a different set of primitives. QMutex
>>has
>> > nothing to do with transactional memory.
>> 
>> Then you didn't read the manual. I'm going to ignore what you said
>>until you
>> read the manual because transactional memory has everythig to do with
>> QMutex.
>> 
>> Please, either believe me or read the manual.
>
>Do you have a link to that manuel?
>
>Transactional memory is about detecting conflicts between transactions,
>and 
>rolling back.
>Mutexes are about locking until the resource is free
>
>Transactional memory could be used to simplify the code that allocate the
>QMutexPrivate.
>
>But I doubt the hypothetical gain is even worth the function call.
>
>Because remember the uncontended case is much more critical than the
>contended 
>case. (There is no point in winning a dozens of cycles if we are going to
>pay 
>anyway several hendreds in the context switch that follow)
>
>But fast uncontended case is critical because it allows the use of mutex
>in 
>part that might or not be used with threads. Example: QObject. We need to
>put 
>mutexes while emiting a signal because maybe it is used in multiple
>thread. 
>But we don't want to pay too much when used in a single thread (the
>common 
>case)
>
>> > > (P2) optimise the Mac and Windows implementations to avoid the need
>>for
>> > > allocating a dynamic d pointer in case of contention. In fact,
>>remove
>> > > the
>> > > need for dynamic d-pointer allocation altogether: Mac, Windows and
>>Linux
>> > > should never do it, while the generic Unix implementation should do
>>it
>> > > all
>> > > the time in the constructor
>> > 
>> > The allocation of the d-ponter is not that dynamic.
>> > It is already quite optimized in Qt5
>> 
>> My suggestion was to avoid the QMutexPrivate allocation on Mac and
>>Windows,
>> since they require just a bit of initialisation. Now, given what you
>>said
>> about semaphore_t, we may not be able to do it on the Mac. But we can
>>try to
>> apply the same optimisation for Windows -- the initialisation is a call
>>to
>> CreateEvent.
>
>But CreateEvent still probably allocate memory behind.
>
>> > I beleive there is more important priority right now than touching
>>QMutex,
>> > which already had its lifting for Qt5.
>> 
>> I disagree. If we shoot ourselves in the foot by not being able to
>>support
>> TSX and valgrind in Qt 5, we've lost a lot. That's why I propose
>> de-inlining for 5.0, so we have enough time to investigate those
>>potential
>> drawbacks.
>
>I think the inline is not a problem for valgrind.
>
>And I don't think we can gain much with TSX.
>
>And even if we could, we still do pretty much everything in a binary
>compatible way despite the inlines (We can say 2 means unlocked and 3
>locked, 
>then the old code fallbacks to the non-inline case (The first lock would
>handle the 0 or 1))
>
>Is it not however shooting ourself in the feet not to inline it? Because
>we 
>hardly can inline it later in a binary compatible way.
>
>-- 
>Olivier
>
>Woboq - Qt services and support - http://woboq.com
>_______________________________________________
>Development mailing list
>Development at qt-project.org
>http://lists.qt-project.org/mailman/listinfo/development