[Interest] Heavily Commented Example: Simple Single Frontend with Two BackendsHi,

Wed Oct 24 02:19:13 CEST 2012

On terça-feira, 23 de outubro de 2012 19.14.48, K. Frank wrote:
> > A volatile load requires the compiler to *load* again, but it doesn't
> > instruct the processor how to load it. The processor may serve the load
> > from any cache level or from main memory.
> 
> Okay, I'll buy that.  At the language-semantics level, volatile prohibits
> the compiler form optimizing the load (or store) away.

Correct. A volatile load implies that the compiler must generate a load. It 
doesn't say what type of load that is.

The IA-64 ABI says what type of load it is. The C11 and C++11 standards also 
specify an atomic_load() that can take an ATOMIC_ACQUIRE to produce an 
acquire.

> > Unless the ABI says otherwise. The only ABI I know that says otherwise is
> > IA-64's, that requires a volatile load to be done using the "load acquire"
> > instruction.
> 
> But the language standard does not prohibit the load (or store) from
> acting on a (potentially core-specific) cache.

Correct. The processor is allowed to satisfy a generic load from anywhere that 
the architecture deems acceptable.

On x86, the "MOV" load will load from any cache but the CPU will ensure that 
it behaves in fully-ordered mode. On ARM and on IA-64, the "ld" instructions 
may load from anywhere and do *not* have to obey any ordering. This is about 
ordering, not about where it loads from.

Some architectures have special load or store instructions that specify what 
cache level they should hit. On x86, it's the MOVNTDQ instruction (NT = non-
temporal = don't cache) which can only be used for stores. On IA-64, both "ld" 
and "st" can receive the ".nta" suffix for non-temporal locality and "ld" can 
also get an ".nt1" for specifying non-temporal for level 1 only.

Those instructions are not accessible via regular programming models. You 
either need to write assembly or use an intrinsic.

> > What you're missing is that MMIO requires the memory address to be
> > uncacheable. That means the processor will bypass all cache levels and
> > will
> > just issue the right load in the memory bus. But all of that is outside
> > the
> > compiler's control. It simply loads from an address you gave it.
> 
> If the hardware has "magic" (i.e., uncacheable) addresses, then a load
> to one of those addresses won't get cached.

Right.

> So memory-mapped IO, for example, works because of cooperation
> between the language's support of volatile (load cannot be optimized
> away) and the hardware (load is from memory, not the cache, if the
> address is uncacheable).

Right.

> >> Let me state for the record that I do not use volatiles for thread
> >> synchronization.  But the issue at hand is not whether a volatile
> >> can be used for full-featured thread synchronization, but whether
> >> it can be used by one thread to signal a second looping thread
> >> to quit.
> > 
> > It can. In that restricted scenario, even a non-atomic write would be
> > sufficient.
> 
> But now I'm confused.
> 
> It would seem that from what you've said above, this thread signalling
> mechanism could easily fail on, for example, a multicore ARMv7
> platform.

Not exactly. The explanation for that is below.

> CPU A writes to the volatile signalling variable, but it writes to its
> CPU-specific cache.  The thread that's looping runs on CPU B,
> and repeatedly reads from the volatile signalling variable, but always
> reads it from its own CPU-specific cache.  So it never gets the
> signal to quit, and potentially loops forever.

Well, that's not exactly how processors work. CPU A will eventually get to 
write the data from its cache back to main RAM. And CPU B will eventually get 
to notice that and discard its cache. So the code running on CPU B will 
eventually get to see the new value.

The question is only how long that might take.

Also note that you should not implement a busy-wait loop like that, like a 
spinlock. At least on x86, if one CPU is constantly reading from the same 
address, it will prevent the others from writing to it. Since you're busy-
looping, you're spending power.

> >> It seems to me that volatile must cause the signalling
> >> thread to perform an actual memory store and the thread to be
> >> signalled to perform an actual memory fetch.  Yes, there is a
> >> potential race condition in that the looping thread may read the
> >> volatile after the signalling thread has written to it, but because
> >> the thread is looping, it will come around and read the volatile
> >> again, picking up the new value that signals it to quit.
> > 
> > Correct.
> 
> Again, I'm confused.  Based on what you said above, why can't
> the looping thread keep reading the stale (but volatile) value from
> its cache, and, in fact, never pick up the new value?

Because eventually the CPUs will figure it out.

> >> Is there any plausible reading of the standard by which this
> >> signalling mechanism could fail?  Do you know of any mainstream
> >> platforms (hardware, os, and compiler) on which this signalling
> >> mechanism does fail?
> > 
> > No, I can't think of any for this particular case.
> 
> But couldn't it fail on the platforms you mentioned above, for
> example, ARMv7 or Sparc?

No.

I *was* about to write that they'd never find it out on a particular 
circumstance (the cacheline was never invalidated by CPU B), but then I came 
to the realisation that this case simply can't exist by design of the 
architecture. Even on IA-64, you can implement threading with relaxed (non-
acquire/release) operations only.

Unless we have the "and then there's Alpha" case... I can't speak for Alpha.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20121023/e490ac8e/attachment.sig>