[Interest] Heavily Commented Example: Simple Single Frontend with Two BackendsHi,

Wed Oct 24 20:37:14 CEST 2012

Hello Thiago!

Thank you for following up.

On Wed, Oct 24, 2012 at 1:58 PM, Thiago Macieira
<thiago.macieira at intel.com> wrote:
> On quarta-feira, 24 de outubro de 2012 12.09.05, K. Frank wrote:
>> On Wed, Oct 24, 2012 at 10:56 AM, Thiago Macieira
>>
>> <thiago.macieira at intel.com> wrote:
>> > On terça-feira, 23 de outubro de 2012 17.19.13, Thiago Macieira wrote:
>> >> Well, that's not exactly how processors work. CPU A will eventually get
>> >> to
>> >> write the data from its cache back to main RAM. And CPU B will eventually
>> >> get  to notice that and discard its cache. So the code running on CPU B
>> >> will eventually get to see the new value.
>> >>
>> >> The question is only how long that might take.
>> >
>> > Actually, I take this back too.
>>
>> I'm confused.  Which part are you taking back?  That CPU B will
>> see the new value?  Or that there is a question about how long
>> it will take?  (Or both?)
>
> That it might take unreasonably long time.
>
> CPU A will write to its cache, its cache will flush; CPU B's caches will be
> invalidated after an implementation-defined time and thus the code in CPU B
> will notice the change in value. What's more, the time it takes is usually
> small, provided that there's no misbehaving code.
>
> I realised this because both ARM and IA-64 -- architectures with weak memory
> ordering -- have no special instruction for this kind of activity. If all you
> need is a flag signalling a condition, you'd use the standard "ld1" / "ld4"
> instruction on IA-64 or the "ldrb" / "ldr" instruction on ARM.
>
> Since there's no special instruction to be used, it stands to reason that the
> architecture must somehow make the value available to other CPUs in reasonable
> time.

Okay.  I'll buy that.  It is certainly how I would want my hardware to behave.

>> > There's no instruction to make the CPU flush the caches sooner, at least
>> > not one that programs usually use. Same thing on the other end: no
>> > instruction to make a load faster.
>>
>> It sounds like you're saying that in "normal code," nothing will explicitly
>> flush the cache to main memory or refresh it from main memory.
>
> Yes.
>
>> > So the code that the compiler generates is probably fine.
>>
>> But for the code to be fine, CPU A's cache must, at some point, get
>> flushed to main memory, and CPU B's cache must get refreshed from
>> main memory (or a mutually shared secondary cache).  Sorry for not
>> understanding what you're saying.
>
> Yes, and they will, without you or the compiler doing anything special.
>
>> > All you need to do is ensure that it *does* generate the load.
>>
>> Again, to belabor the question, suppose the compiler does generate
>> the load of the stop flag (i.e., the read of the stop flag from its memory
>> address -- which might be cached -- is not optimized away).  What,
>> in principle, prevents the thread whose loop polls the stop flag from
>> just sitting on CPU B (never, because of odd luck, being context
>> switched off of CPU B), and repeatedly reading the stale stop flag from
>> CPU B's cache?
>
> The architecture. It's designed in such a way that the code works.
>
>> Now, in practice, it's highly unlikely that CPU B won't ever refresh its
>> cache or that the stop-flag-polling thread won't ever be context switched
>> onto a core with an up-to-date cache, but isn't it in principle possible
>> for the flag-polling thread to keep reading the stale value of the stop
>> flag, and run forever, even though some other thread set the stop flag
>> to true?
>
> That's what I initially thought when I wrote the email, but I eventually
> decided to remove before answering. Then I realised that even what I had
> written was still wrong.
>
> I was thinking of this scenario:
>  - this task is the only task scheduled to CPU B
>    the OS will not context-switch it to another processor nor context-switch
>    another task in
>  - this task is not executing any instructions that would force a cache flush
>  - the OS is not executing any tasks in any processor that would force that
>
> Under those conditions, my reasoning was that the CPU B's local caches could
> contain the old value for an indefinite period of time, maybe forever.

Yes, that was the scenario I was also imagining.

> However, that notion is absurd. Caches aren't designed to do that. They are
> designed to keep a certain amount of data closer to the CPU for faster access,
> but never to provide wrong data -- for some definition of "wrong". All multi-
> processor systems need some kind of mechanism to keep caches coherent.
>
> The solution on Intel x86 is that a CPU writing to a particular cacheline must
> exclusively acquire it. One core or package that writes to a cacheline will
> actively go after the same cacheline in other caches and invalidate them. This
> solution is simple and effective if you have only a handful of caches, but it
> doesn't scale much -- which is why Intel still makes a lot of money from
> Itanium, an architecture whose death has been predicted over and over again.
> It also creates problems like "false sharing" (cf. the IA-32 "Optimisation
> Reference Manual" user/source coding rule 23).
>
> The solution on Itanium, which is designed for scaling a lot more, is that the
> common, off-die L3 cache contains a listing of which caches have been
> invalidated by each processor. So when CPU A writes to that cacheline, it will
> notify the L3. When CPU B reads from that cacheline, it will check with the L3
> and find out that its cached data has been invalidated.
>
> What I don't remember by heart is *how* and *when* those things happen. I know
> they happen.
>
> If you want to read some more on this subject, take a look at this article:
>         http://www.realworldtech.com/poulson/7/
>
> It's talking about the Poulson, the codename for the Itanium processor that
> has apparently not been released yet. The current Itanium processor is the
> Tukwila. The x86 family processor that the article talks about is the
> Westmere-EX, the family of Xeon processors comparable to the first generation
> Core-i7. There have been two more generations of Xeon since that article.

So, to summarize what you're saying, real world hardware is not
excessively perverse.

Therefore the volatile-bool scheme for signalling a thread to stop will
work in practice (but volatile is not good enough for full-featured thread
synchronization), because the hardware will cause the updated value
of the stop flag to become visible to the polling thread in a reasonable
amount of time.  The polling thread might run a little longer than expected,
but it will stop in a reasonable amount of time (for some reasonable
definition of "reasonable").

> Thiago Macieira - thiago.macieira (AT) intel.com
>   Software Architect - Intel Open Source Technology Center

Again, thank you for your explanations.

K. Frank