[Interest] Heavily Commented Example: Simple Single Frontend with Two BackendsHi,

Wed Oct 24 01:14:48 CEST 2012

Hi Thiago!

Thank you for your detailed explanation.  I am still confused; it sounds
like you are saying two different things.  (Further comments below.)

On Tue, Oct 23, 2012 at 6:11 PM, Thiago Macieira
<thiago.macieira at intel.com> wrote:
> On terça-feira, 23 de outubro de 2012 16.14.11, K. Frank wrote:
>> > volatile only forces the compiler to create an instruction to fetch the
>> > variable from memory again, to prevent caching in a register. The CPU
>> > doesn't even know about the volatile keyword anymore, it just sees a
>> > normal fetch instruction, and can therefore use the CPU cache.
>>
>> If I understand the original purpose of volatile, to confirm, reading a
>> volatile should cause a true memory fetch, not just a fetch from
>> cache.  (As you mention below, if volatile is used for memory-mapped
>> IO, that IO won't actually occur if the fetch is just a cache fetch.)
>
> No, that's not what it means.
>
> A volatile load requires the compiler to *load* again, but it doesn't instruct
> the processor how to load it. The processor may serve the load from any cache
> level or from main memory.

Okay, I'll buy that.  At the language-semantics level, volatile prohibits
the compiler form optimizing the load (or store) away.

> Unless the ABI says otherwise. The only ABI I know that says otherwise is
> IA-64's, that requires a volatile load to be done using the "load acquire"
> instruction.

But the language standard does not prohibit the load (or store) from
acting on a (potentially core-specific) cache.

> What you're missing is that MMIO requires the memory address to be
> uncacheable. That means the processor will bypass all cache levels and will
> just issue the right load in the memory bus. But all of that is outside the
> compiler's control. It simply loads from an address you gave it.

If the hardware has "magic" (i.e., uncacheable) addresses, then a load
to one of those addresses won't get cached.

So memory-mapped IO, for example, works because of cooperation
between the language's support of volatile (load cannot be optimized
away) and the hardware (load is from memory, not the cache, if the
address is uncacheable).

This makes good sense.

>> > Therefore, if your two threads are living on different CPUs, one CPU might
>> > not see the update on the other CPU, since the CPU caches are not
>> > updated. volatile does not help with that, you need proper memory
>> > barriers.
>> Let's say that CPU A writes to the volatile and CPU B reads from
>> it.  Isn't it the case that A's write to the volatile must cause a true
>> memory store and not just a write to cache?  (Again, memory-mapped
>> IO would not work if the store is just a write to cache.)
>>
>> Then when CPU B reads the volatile, mustn't it perform an actual
>> memory fetch, picking up the result of A's memory store?
>
> That depends on the architecture.
>
> If the write to memory had release semantics and the read had acquire
> semantics, then the two CPUs must -- somehow -- figure out and synchronise. For
> example, on IA-64, the store-release causes CPU A to mark the address as
> modified in the L3 off-die cache and the load-acquire from CPU B requires it to
> go check the L3 cache.
>
> On x86, the store from CPU A causes it to go and invalidate all cachelines
> containing that address in the other CPUs' caches. So when CPU B tries to
> read, it will be forced to go to main memory or the L3 off-die cache.
>
> On those two architectures, a "volatile" qualifier is enough to ensure proper
> behaviour. On x86, because all loads and stores are fully ordered anyway and
> on IA-64, because the ABI requires volatile loads to acquire and volatile
> stores to release.

Okay.  The scheme works on x86 because of how the hardware works,
and on IA-64 because of its ABI guarantees.

> But if you go beyond those two Intel architectures, the bets are off. On ARMv7,
> for example, the ABI does not require a volatile load or store to insert the
> "dmb" instruction. That means in your example, CPU B would not read from the
> main memory and it could fetch the value from one of its stale caches. The
> same goes for PowerPC/POWER, MIPS, Sparc, etc.

But need not work on other architectures, and ARMv7 and the others
you mention are examples of this.

I'll buy all that.

>> Let me state for the record that I do not use volatiles for thread
>> synchronization.  But the issue at hand is not whether a volatile
>> can be used for full-featured thread synchronization, but whether
>> it can be used by one thread to signal a second looping thread
>> to quit.
>
> It can. In that restricted scenario, even a non-atomic write would be
> sufficient.

But now I'm confused.

It would seem that from what you've said above, this thread signalling
mechanism could easily fail on, for example, a multicore ARMv7
platform.

CPU A writes to the volatile signalling variable, but it writes to its
CPU-specific cache.  The thread that's looping runs on CPU B,
and repeatedly reads from the volatile signalling variable, but always
reads it from its own CPU-specific cache.  So it never gets the
signal to quit, and potentially loops forever.

>> It seems to me that volatile must cause the signalling
>> thread to perform an actual memory store and the thread to be
>> signalled to perform an actual memory fetch.  Yes, there is a
>> potential race condition in that the looping thread may read the
>> volatile after the signalling thread has written to it, but because
>> the thread is looping, it will come around and read the volatile
>> again, picking up the new value that signals it to quit.
>
> Correct.

Again, I'm confused.  Based on what you said above, why can't
the looping thread keep reading the stale (but volatile) value from
its cache, and, in fact, never pick up the new value?

>> Is there any plausible reading of the standard by which this
>> signalling mechanism could fail?  Do you know of any mainstream
>> platforms (hardware, os, and compiler) on which this signalling
>> mechanism does fail?
>
> No, I can't think of any for this particular case.

But couldn't it fail on the platforms you mentioned above, for
example, ARMv7 or Sparc?

Sorry to belabor this, but there must be something I'm missing here.

> Thiago Macieira - thiago.macieira (AT) intel.com
>   Software Architect - Intel Open Source Technology Center

Thanks again for your insight.

K. Frank