[Interest] Heavily Commented Example: Simple Single Frontend with Two BackendsHi,

Wed Oct 24 19:58:30 CEST 2012

On quarta-feira, 24 de outubro de 2012 12.09.05, K. Frank wrote:
> On Wed, Oct 24, 2012 at 10:56 AM, Thiago Macieira
> 
> <thiago.macieira at intel.com> wrote:
> > On terça-feira, 23 de outubro de 2012 17.19.13, Thiago Macieira wrote:
> >> Well, that's not exactly how processors work. CPU A will eventually get
> >> to
> >> write the data from its cache back to main RAM. And CPU B will eventually
> >> get  to notice that and discard its cache. So the code running on CPU B
> >> will eventually get to see the new value.
> >> 
> >> The question is only how long that might take.
> > 
> > Actually, I take this back too.
> 
> I'm confused.  Which part are you taking back?  That CPU B will
> see the new value?  Or that there is a question about how long
> it will take?  (Or both?)

That it might take unreasonably long time.

CPU A will write to its cache, its cache will flush; CPU B's caches will be 
invalidated after an implementation-defined time and thus the code in CPU B 
will notice the change in value. What's more, the time it takes is usually 
small, provided that there's no misbehaving code.

I realised this because both ARM and IA-64 -- architectures with weak memory 
ordering -- have no special instruction for this kind of activity. If all you 
need is a flag signalling a condition, you'd use the standard "ld1" / "ld4" 
instruction on IA-64 or the "ldrb" / "ldr" instruction on ARM.

Since there's no special instruction to be used, it stands to reason that the 
architecture must somehow make the value available to other CPUs in reasonable 
time.

> > There's no instruction to make the CPU flush the caches sooner, at least
> > not one that programs usually use. Same thing on the other end: no
> > instruction to make a load faster.
> 
> It sounds like you're saying that in "normal code," nothing will explicitly
> flush the cache to main memory or refresh it from main memory.

Yes.

> > So the code that the compiler generates is probably fine.
> 
> But for the code to be fine, CPU A's cache must, at some point, get
> flushed to main memory, and CPU B's cache must get refreshed from
> main memory (or a mutually shared secondary cache).  Sorry for not
> understanding what you're saying.

Yes, and they will, without you or the compiler doing anything special.

> > All you need to do is ensure that it *does* generate the load.
> 
> Again, to belabor the question, suppose the compiler does generate
> the load of the stop flag (i.e., the read of the stop flag from its memory
> address -- which might be cached -- is not optimized away).  What,
> in principle, prevents the thread whose loop polls the stop flag from
> just sitting on CPU B (never, because of odd luck, being context
> switched off of CPU B), and repeatedly reading the stale stop flag from
> CPU B's cache?

The architecture. It's designed in such a way that the code works.

> Now, in practice, it's highly unlikely that CPU B won't ever refresh its
> cache or that the stop-flag-polling thread won't ever be context switched
> onto a core with an up-to-date cache, but isn't it in principle possible
> for the flag-polling thread to keep reading the stale value of the stop
> flag, and run forever, even though some other thread set the stop flag
> to true?

That's what I initially thought when I wrote the email, but I eventually 
decided to remove before answering. Then I realised that even what I had 
written was still wrong.

I was thinking of this scenario:
 - this task is the only task scheduled to CPU B
   the OS will not context-switch it to another processor nor context-switch
   another task in
 - this task is not executing any instructions that would force a cache flush
 - the OS is not executing any tasks in any processor that would force that

Under those conditions, my reasoning was that the CPU B's local caches could 
contain the old value for an indefinite period of time, maybe forever.

However, that notion is absurd. Caches aren't designed to do that. They are 
designed to keep a certain amount of data closer to the CPU for faster access, 
but never to provide wrong data -- for some definition of "wrong". All multi-
processor systems need some kind of mechanism to keep caches coherent.

The solution on Intel x86 is that a CPU writing to a particular cacheline must 
exclusively acquire it. One core or package that writes to a cacheline will 
actively go after the same cacheline in other caches and invalidate them. This 
solution is simple and effective if you have only a handful of caches, but it 
doesn't scale much -- which is why Intel still makes a lot of money from 
Itanium, an architecture whose death has been predicted over and over again. 
It also creates problems like "false sharing" (cf. the IA-32 "Optimisation 
Reference Manual" user/source coding rule 23).

The solution on Itanium, which is designed for scaling a lot more, is that the 
common, off-die L3 cache contains a listing of which caches have been 
invalidated by each processor. So when CPU A writes to that cacheline, it will 
notify the L3. When CPU B reads from that cacheline, it will check with the L3 
and find out that its cached data has been invalidated.

What I don't remember by heart is *how* and *when* those things happen. I know 
they happen.

If you want to read some more on this subject, take a look at this article:
	http://www.realworldtech.com/poulson/7/

It's talking about the Poulson, the codename for the Itanium processor that 
has apparently not been released yet. The current Itanium processor is the 
Tukwila. The x86 family processor that the article talks about is the 
Westmere-EX, the family of Xeon processors comparable to the first generation 
Core-i7. There have been two more generations of Xeon since that article.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20121024/78670c51/attachment.sig>