[Interest] Heavily Commented Example: Simple Single Frontend with Two BackendsHi,
kfrank29.c at gmail.com
Wed Oct 24 20:37:14 CEST 2012
Thank you for following up.
On Wed, Oct 24, 2012 at 1:58 PM, Thiago Macieira
<thiago.macieira at intel.com> wrote:
> On quarta-feira, 24 de outubro de 2012 12.09.05, K. Frank wrote:
>> On Wed, Oct 24, 2012 at 10:56 AM, Thiago Macieira
>> <thiago.macieira at intel.com> wrote:
>> > On terça-feira, 23 de outubro de 2012 17.19.13, Thiago Macieira wrote:
>> >> Well, that's not exactly how processors work. CPU A will eventually get
>> >> to
>> >> write the data from its cache back to main RAM. And CPU B will eventually
>> >> get to notice that and discard its cache. So the code running on CPU B
>> >> will eventually get to see the new value.
>> >> The question is only how long that might take.
>> > Actually, I take this back too.
>> I'm confused. Which part are you taking back? That CPU B will
>> see the new value? Or that there is a question about how long
>> it will take? (Or both?)
> That it might take unreasonably long time.
> CPU A will write to its cache, its cache will flush; CPU B's caches will be
> invalidated after an implementation-defined time and thus the code in CPU B
> will notice the change in value. What's more, the time it takes is usually
> small, provided that there's no misbehaving code.
> I realised this because both ARM and IA-64 -- architectures with weak memory
> ordering -- have no special instruction for this kind of activity. If all you
> need is a flag signalling a condition, you'd use the standard "ld1" / "ld4"
> instruction on IA-64 or the "ldrb" / "ldr" instruction on ARM.
> Since there's no special instruction to be used, it stands to reason that the
> architecture must somehow make the value available to other CPUs in reasonable
Okay. I'll buy that. It is certainly how I would want my hardware to behave.
>> > There's no instruction to make the CPU flush the caches sooner, at least
>> > not one that programs usually use. Same thing on the other end: no
>> > instruction to make a load faster.
>> It sounds like you're saying that in "normal code," nothing will explicitly
>> flush the cache to main memory or refresh it from main memory.
>> > So the code that the compiler generates is probably fine.
>> But for the code to be fine, CPU A's cache must, at some point, get
>> flushed to main memory, and CPU B's cache must get refreshed from
>> main memory (or a mutually shared secondary cache). Sorry for not
>> understanding what you're saying.
> Yes, and they will, without you or the compiler doing anything special.
>> > All you need to do is ensure that it *does* generate the load.
>> Again, to belabor the question, suppose the compiler does generate
>> the load of the stop flag (i.e., the read of the stop flag from its memory
>> address -- which might be cached -- is not optimized away). What,
>> in principle, prevents the thread whose loop polls the stop flag from
>> just sitting on CPU B (never, because of odd luck, being context
>> switched off of CPU B), and repeatedly reading the stale stop flag from
>> CPU B's cache?
> The architecture. It's designed in such a way that the code works.
>> Now, in practice, it's highly unlikely that CPU B won't ever refresh its
>> cache or that the stop-flag-polling thread won't ever be context switched
>> onto a core with an up-to-date cache, but isn't it in principle possible
>> for the flag-polling thread to keep reading the stale value of the stop
>> flag, and run forever, even though some other thread set the stop flag
>> to true?
> That's what I initially thought when I wrote the email, but I eventually
> decided to remove before answering. Then I realised that even what I had
> written was still wrong.
> I was thinking of this scenario:
> - this task is the only task scheduled to CPU B
> the OS will not context-switch it to another processor nor context-switch
> another task in
> - this task is not executing any instructions that would force a cache flush
> - the OS is not executing any tasks in any processor that would force that
> Under those conditions, my reasoning was that the CPU B's local caches could
> contain the old value for an indefinite period of time, maybe forever.
Yes, that was the scenario I was also imagining.
> However, that notion is absurd. Caches aren't designed to do that. They are
> designed to keep a certain amount of data closer to the CPU for faster access,
> but never to provide wrong data -- for some definition of "wrong". All multi-
> processor systems need some kind of mechanism to keep caches coherent.
> The solution on Intel x86 is that a CPU writing to a particular cacheline must
> exclusively acquire it. One core or package that writes to a cacheline will
> actively go after the same cacheline in other caches and invalidate them. This
> solution is simple and effective if you have only a handful of caches, but it
> doesn't scale much -- which is why Intel still makes a lot of money from
> Itanium, an architecture whose death has been predicted over and over again.
> It also creates problems like "false sharing" (cf. the IA-32 "Optimisation
> Reference Manual" user/source coding rule 23).
> The solution on Itanium, which is designed for scaling a lot more, is that the
> common, off-die L3 cache contains a listing of which caches have been
> invalidated by each processor. So when CPU A writes to that cacheline, it will
> notify the L3. When CPU B reads from that cacheline, it will check with the L3
> and find out that its cached data has been invalidated.
> What I don't remember by heart is *how* and *when* those things happen. I know
> they happen.
> If you want to read some more on this subject, take a look at this article:
> It's talking about the Poulson, the codename for the Itanium processor that
> has apparently not been released yet. The current Itanium processor is the
> Tukwila. The x86 family processor that the article talks about is the
> Westmere-EX, the family of Xeon processors comparable to the first generation
> Core-i7. There have been two more generations of Xeon since that article.
So, to summarize what you're saying, real world hardware is not
Therefore the volatile-bool scheme for signalling a thread to stop will
work in practice (but volatile is not good enough for full-featured thread
synchronization), because the hardware will cause the updated value
of the stop flag to become visible to the polling thread in a reasonable
amount of time. The polling thread might run a little longer than expected,
but it will stop in a reasonable amount of time (for some reasonable
definition of "reasonable").
> Thiago Macieira - thiago.macieira (AT) intel.com
> Software Architect - Intel Open Source Technology Center
Again, thank you for your explanations.
More information about the Interest