fmuladd in C/SSE for Intel x86 24Gflops/core

I ported the fmuladd nVidia code to x86-SSE code using Intel’s “xmmintrin.h” and the good support provided by GCC 4.0.1 (Apple Darwin version).

The code-source is provided, compile it using “gcc -O3 sse4l.c”

Notice that I modified the count of iteration, by a 10X factor, to enable use of the time utility (to stay almost portable to any x86 SSE platform).
So the resulting time should be divided by 10 to compare it to fmuladd measurements.

Performances are about 18GFlops on my Core2 Duo 2.4Ghz, far from the 45Gflops reported for the GeForce 800M GT (Mobile and slow).
A Mac Pro (8x3.2 Ghz Xeon) is around 12GFlops/core, for a total of 96Gflops.
sse4l.c.gz (1.06 KB)

Notice that the Core2 Duo Penryn (and Penryn Xeon) use macro-fusion on 32bits platforms to fusion the floating point multiplication and subsequent fp addition into one internal fp muladd that explain that combined to SSE (4 parallels fp operations) and use of 4-interleaved set of data (the 8 xmm registers accessible in 32bits mode), the CPU is able to output 8 fp (4muladd = 4mul + 4 add) per cycle!

Impressing work from Intel!

sorry,but where is gpu version? I want also compiler a gpu version for test.
thank advance

This is the nVidia source-code, was linked from the forum, so I repost it here for comparison purpose, with the same 10X increase in fmuladd operations to cope with the C-sse source-code.
fmuladd.tar.gz (2.66 KB)

What a shame, the CPU/SSE source file does not compile within Visual Studio. :(

It keeps bitching about “*” operator being unsuitable for “struct” types.

Same problem with Intel C++ 10.1, with slightly different error message:

1>sse4l.c(54): error: expression must have arithmetic type

1>       a = a * b + b; 

1>           ^


Just do this:

#define FMADDSSE32(a, b, c, d, e, f, g, h) \

a = _mm_add_ss(_mm_mul_ss(a, b), b);\

c = _mm_add_ss(_mm_mul_ss(c, d), d);\

e = _mm_add_ss(_mm_mul_ss(e, f), f);\

g = _mm_add_ss(_mm_mul_ss(g, h), h);\

b = _mm_add_ss(_mm_mul_ss(a, b), a);\

d = _mm_add_ss(_mm_mul_ss(c, d), c);\

f = _mm_add_ss(_mm_mul_ss(e, f), e);\

h = _mm_add_ss(_mm_mul_ss(g, h), g)

not a problem at all :-)

I’ve faced another one - executable does not work on some Intel x86 machines … however, it was enough to collect the statistics.

Eheh, not so easy unfortunately:

1>sse4l.c(78): error: expression must have arithmetic or pointer type

1>     b = a+b+c+d+e+f+g+h;

1>         ^


The code is meant to be portable, using GCC 4.0.1 or later, not being compatible with proprietay closed-source compilers ;-)

BTW, if you output the result of gcc -O3 -S with this code compared to "_mm_add_ss(_mm_mul_ss(a, B), a); " for example, you will see it’s TOTALLY different, in the way that you will end-up with many memory<->xmm register shuffling.

Another concern is that using _mm_add_ss or _mm_mul_ss you will need a knowledge of the SSE instruction set (and xmmintrinsic.h library), albeit adding just _m128 lead to a code that is nearly pure from architecture-specific instructions/function call/MACRO

Please use GCC, it’s free and realy efficient, and using my SSE code with GCC instead _mm_add_ps as suggested above, you may end up with a really different performance level.

Well, the company I work for uses Visual Studio; so it’s easier for me to use the same tools in my spare time too.
Of course I use gcc for my little Linux projects, but 90% of my work (including CUDA-related stuff) is within Windows and Visual Studio is just about the standard for ordinary Windows programming.

Well thank you anyway, it’s good to know that SSE can put out some impressive performances if done right! :thumbup:


You should probably use _mm_add_ps and _mm_mul_ps, since *_ss versions only act on one of the four single precision floats in the 128 bit SSE2 register.

See for the full list of functions understood by MSVC.

I rewrote part of the fmuladd CPU code, to give these informations

  • CPU core performance using float (time + gflops/s)
    (in fact GCC may use SSE single-float code, surprisingly)
  • SSE core performance using float (time + gflops/s)

Due to incompatibilities between architectures and some that don’t differenciate Hyper-threading from actual physical core, I just focused on a single-core micro-benchmark to show the sustained gflops on Intel cpu architecture using float code or SSE.

Please compile it using GCC 4.0 or later.
gcc -O3 -o fmuladd-cpu fmuladd-cpu.c

Let me know if you have problem with gcc, or if you report some weird performances. (2 KB)