[Interest] CppCon Just Ended (next year will be 20-25 Sep 2015)

charleyb123 . charleyb123 at gmail.com
Mon Sep 15 20:55:56 CEST 2014


<snip, CppCon 2014>

> > *- Facebook (Andrei Alexandrescu) "cheats" with "stealing bits" from
> their implementation of a "shared-pointer" to "steal-bits" for the
> "object-count", with the understanding that the count rarely goes above
> "4".  Correctness is sacrificed to increase performance 0.4%, which is a
> "huge win" for server load and electricity (saves $millions a month).
> "Leak, Log, and Leave" is what they do for the uncommon correctness
> failure-case.
>

Oliver commented:

So my first thought is that since they observed that "the reference count
> rarely goes above 4" that they restrict their counters to N bits only,
> where N < 8 ("Byte").
>
> So let's say they'd use only 4 bits for their reference counters in their
> "smart pointer" implementation. If you placed all the reference counters
> into an "external array structure", "array of char" (which would - somehow
> - be "globally accessible" without storing an extra pointer to it in each
> smart pointer instance), then you could place two reference counters into a
> char, hence reducing the memory usage by factor 2 (compared to reference
> counters implemented as "char" (byte) only (*)).
>
> So you save a little (how much?) memory. Let's be generous: let's say a
> typical application of theirs would consume 20% less RAM with such "memory
> optimised counters". Saving a few GB of RAM on their servers could probably
> indeed save you some bucks on the electricity bill.
>
>
> But off course you'd need to keep track of reference counters which go to
> 0 (-> "re-use") - or you just let the "array of compressed reference
> counters" grow.
>
> But then the biggest matter is off course leaked memory: once such a "4
> bit reference counter" gets "saturated", you have to "mark" that smart
> pointer as "is leaking": there is no way the involved object could ever be
> safely deleted. Depending on the size of the involved objects I don't think
> you'd have a net gain in memory footprint reduction, especially on server
> applications (unless you restart the server instance for "extensive garbage
> collection" every 24 hours or so ;)). Besides, you'd need extra state
> variables to mark the smart pointer as "leaking" (or you'd say that
> MAX_COUNT of the counter implies "leaking").
>
>
> So I don't think by "stealing bits" they want to save on memory usage, do
> they?
>
> Hence the big question: how then? Are there arithmetic operations such as
> "Increment/Decrement" which are faster on "a coupble of bits" than, say, an
> Integer (32 or 64 bit)? Other "bit fiddling" than one can exploit on such
> small values that I cannot think of right now?
>
> ("Parallel incrementing" two values packed into a single value comes to my
> mind - but such SIMD instructions exist for Integer (Float? Double?), and
> not for char values, right? And even so: how would that help if you wanted
> to increment ONE counter in a given smart pointer instance?)
>

My understanding is that Facebook's internal "shared_ptr<>" uses a "Tagged
Pointer" implementation, where it is merely a single "T*" address, and not
a (T*,int) for the address-and-count.  See:
http://en.wikipedia.org/wiki/Tagged_pointer

So, although the pointer is 64-bits, current processors actually only
support 48 or 56 bits, so there is apparently a lot of leftover room for
burgling:  https://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses

In this case, they get HUGE savings because the entire shared_ptr<> now
entirely fits in 64-bits, so it entirely fits in the L1 cache.  Smaller
executable size also, but invalidating the L1 cache is apparently, "the
biggie".  You can't put a shared_ptr<> with a 64-bit pointer in the L1
cache if you also have a companion "int" for the "ref-count", so they
removed it.

Other talks also suggested that you can similarly gain performance
improvement if you push the "most-frequently-accessed-members" of your
object into the first-64-bits (because that's the first part loaded into
the L1 cache), and if you don't need the other members, you similarly "win".

But, your speculation is very interesting for what the Game industry seems
to be doing -- they love these external (raw) vectors-of-data, and then
externally write functions that rip-through that data very fast, or in
parallel, or push the whole thing to the GPU where it resides.

This is my current understanding ... the videos will be posted in the next
couple weeks, and we can correct my (perhaps incorrect) description.

--charley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20140915/dfdaf01c/attachment.html>


More information about the Interest mailing list