[Interest] Cross platform accelerated instructions framework

Thu May 14 16:55:35 CEST 2015

On Thursday May 14 2015 14:52:45 Allan Sandfeld Jensen wrote:

> Alternatively use vector intrinsics, but the generic intrinsics are not that 
> powerful.

To put that in perspective:

there's a yuv4xx conversion routine I once used, I think it's from FFmpeg, which has a hand-optimised SSE2 version in addition to the straightforward scalar version (i.e. using a couple of loops). At some point I noticed that the scalar routine was running 10x faster on OS X than the SSE2 version using gcc, while the SSE2 was a lot (but not 10x) faster than the scalar version using MSVC. That had me stymied until I realised I was using x86_64 on OS X and that allowed gcc's auto-vectorisation to blow away the hand-optimised version.

Of course this had little to no influence on overall performance at all ...

Another thing to take into consideration: on X86, gcc (and presumable clang; dunno about MSVC) use SIMD instructions for regular math too, whether there's something to vectorise or not: see the -mfpmath option. I have at times tried to use intrinsics to speed up certain simple calculations. That usually turned out to be counter-productive: I think the overhead of loading the variables into SIMD types (registers) was larger than the gain.

Take a look at the Eigen3 framework too (eigen.tuxfamily.org), but be prepared that ultimately it may be more efficient to use Apple's Accelerate framework on OS X and iOS, and whatever equivalents exist on the other platforms you target.

R.