[Interest] CppCon Just Ended (next year will be 20-25 Sep 2015)

Al-Khanji Louai Louai.Al-Khanji at digia.com
Tue Sep 16 09:18:19 CEST 2014


> But, your speculation is very interesting for what the Game industry seems to be doing -- they love these external (raw) vectors-of-data, and then externally write functions that rip-through that data very fast, or in parallel, or push the whole thing to the GPU where it resides.

The important part indeed is to store a vector of objects, not a vector of pointers to objects. This is an idea that has really been pushed quite hard for a while now.

For example check out the blog post here: http://bannalia.blogspot.fi/2014/05/fast-polymorphic-collections.html Note the graphs at the end - the size of the container does not impact iteration time per element when you stop storing pointers to objects.

It's the same thing Herb Sutter talks about here: http://channel9.msdn.com/Events/Build/2014/2-661 (starting at around 23-24 minutes in). The talk is geared very much towards C++ newbies, but the numbers that he presents are quite interesting indeed.

-- Louai

________________________________________
From: interest-bounces+louai.al-khanji=digia.com at qt-project.org [interest-bounces+louai.al-khanji=digia.com at qt-project.org] on behalf of charleyb123 . [charleyb123 at gmail.com]
Sent: Monday, September 15, 2014 9:55 PM
To: Till Oliver Knoll
Cc: Qt Project
Subject: Re: [Interest] CppCon Just Ended (next year will be 20-25 Sep 2015)

<snip, CppCon 2014>
> *- Facebook (Andrei Alexandrescu) "cheats" with "stealing bits" from their implementation of a "shared-pointer" to "steal-bits" for the "object-count", with the understanding that the count rarely goes above "4".  Correctness is sacrificed to increase performance 0.4%, which is a "huge win" for server load and electricity (saves $millions a month).  "Leak, Log, and Leave" is what they do for the uncommon correctness failure-case.

Oliver commented:

So my first thought is that since they observed that "the reference count rarely goes above 4" that they restrict their counters to N bits only, where N < 8 ("Byte").

So let's say they'd use only 4 bits for their reference counters in their "smart pointer" implementation. If you placed all the reference counters into an "external array structure", "array of char" (which would - somehow - be "globally accessible" without storing an extra pointer to it in each smart pointer instance), then you could place two reference counters into a char, hence reducing the memory usage by factor 2 (compared to reference counters implemented as "char" (byte) only (*)).

So you save a little (how much?) memory. Let's be generous: let's say a typical application of theirs would consume 20% less RAM with such "memory optimised counters". Saving a few GB of RAM on their servers could probably indeed save you some bucks on the electricity bill.


But off course you'd need to keep track of reference counters which go to 0 (-> "re-use") - or you just let the "array of compressed reference counters" grow.

But then the biggest matter is off course leaked memory: once such a "4 bit reference counter" gets "saturated", you have to "mark" that smart pointer as "is leaking": there is no way the involved object could ever be safely deleted. Depending on the size of the involved objects I don't think you'd have a net gain in memory footprint reduction, especially on server applications (unless you restart the server instance for "extensive garbage collection" every 24 hours or so ;)). Besides, you'd need extra state variables to mark the smart pointer as "leaking" (or you'd say that MAX_COUNT of the counter implies "leaking").


So I don't think by "stealing bits" they want to save on memory usage, do they?

Hence the big question: how then? Are there arithmetic operations such as "Increment/Decrement" which are faster on "a coupble of bits" than, say, an Integer (32 or 64 bit)? Other "bit fiddling" than one can exploit on such small values that I cannot think of right now?

("Parallel incrementing" two values packed into a single value comes to my mind - but such SIMD instructions exist for Integer (Float? Double?), and not for char values, right? And even so: how would that help if you wanted to increment ONE counter in a given smart pointer instance?)

My understanding is that Facebook's internal "shared_ptr<>" uses a "Tagged Pointer" implementation, where it is merely a single "T*" address, and not a (T*,int) for the address-and-count.  See:  http://en.wikipedia.org/wiki/Tagged_pointer

So, although the pointer is 64-bits, current processors actually only support 48 or 56 bits, so there is apparently a lot of leftover room for burgling:  https://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses

In this case, they get HUGE savings because the entire shared_ptr<> now entirely fits in 64-bits, so it entirely fits in the L1 cache.  Smaller executable size also, but invalidating the L1 cache is apparently, "the biggie".  You can't put a shared_ptr<> with a 64-bit pointer in the L1 cache if you also have a companion "int" for the "ref-count", so they removed it.

Other talks also suggested that you can similarly gain performance improvement if you push the "most-frequently-accessed-members" of your object into the first-64-bits (because that's the first part loaded into the L1 cache), and if you don't need the other members, you similarly "win".

But, your speculation is very interesting for what the Game industry seems to be doing -- they love these external (raw) vectors-of-data, and then externally write functions that rip-through that data very fast, or in parallel, or push the whole thing to the GPU where it resides.

This is my current understanding ... the videos will be posted in the next couple weeks, and we can correct my (perhaps incorrect) description.

--charley


More information about the Interest mailing list