I ported the fmuladd nVidia code to x86-SSE code using Intel’s “xmmintrin.h” and the good support provided by GCC 4.0.1 (Apple Darwin version).
The code-source is provided, compile it using “gcc -O3 sse4l.c”
Notice that I modified the count of iteration, by a 10X factor, to enable use of the time utility (to stay almost portable to any x86 SSE platform).
So the resulting time should be divided by 10 to compare it to fmuladd measurements.
Performances are about 18GFlops on my Core2 Duo 2.4Ghz, far from the 45Gflops reported for the GeForce 800M GT (Mobile and slow).
A Mac Pro (8x3.2 Ghz Xeon) is around 12GFlops/core, for a total of 96Gflops.
sse4l.c.gz (1.06 KB)