I am reading the FP on NVidia GPU whitepaper from the CUDA documentation set, and tried to play with the fma example from it. In particular, I tried to see the difference between non-fused and fused calculation as explained in the text.

For that, I wrote the following code

static float
cpu_fma(float a, float b)
	return (a * a + b);

__global__ void
gpu_fma(float a, float b, float *r)
	*r = fma(a, a, b);

Surprisingly, the results for both routines, as printed by the “%.8g” format, are the same. I looked at the ptx code, and see the strange thing for the gpu_fma:

/*0010*/         MOV R0, c[0x0][0x20];           /* 0x2800400080001de4 */
        /*0018*/         MOV R2, c[0x0][0x28];           /* 0x28004000a0009de4 */
        /*0020*/         MOV R3, c[0x0][0x2c];           /* 0x28004000b000dde4 */
        /*0028*/         FFMA R0, R0, R0, c[0x0][0x24];  /* 0x3000800090001c00 */

Note the arguments to the FFMA instructions; this is with nvcc

Built on Thu_Mar_13_11:58:58_PDT_2014
Cuda compilation tools, release 6.0, V6.0.1

on Fedora 20.

I am complete newbie, any idea what I am doing wrong ?

The contraction of multiply-add sequences into FMA is an important optimization the compiler applies. By default nvcc compiles with full optimization, which is why both of your kernels will contain an FMA at machine code level. You can either compile without optimizations, or you can disable FMA contraction specifically, at any optimization level, by adding the command line argument -fmad=false to the nvcc invocation.

The numerical differences between a multipy-add seqence and FMA will be most apparent when the product and the addend are close in magnitude, but differ in sign. Try something like a = 1.999999f and b = -4.0f.

Not sure what you mean by “strange thing”. FFMA is the single-precision FMA instruction. The c references in the machine code (SASS) are constant memory locations used to pass kernel arguments. In particular c[0x0][0x20] contains ‘a’ and c[0x0][0x24] contains ‘b’, and I am guessing that c[0x0][28] and c[0x0][2C] collectively hold the 64-bit pointer ‘r’, although the code using that data is not shown.

There is only one kernel in the code snippet I pasted above, it is gpu_fma(). The cpu_fma() is executed on host, and since this is SandyBridge, the calculation cannot be fused; I verified the generated asm code just in case.

About the strange thing in the -sass dump. I said that I am newbie, so I misread the ffma instruction, where I thought that the order of operands is src1, …, srcN, dst, similar to Intel AT&T syntax, but it is dst, src1, …, srcN. Thank you for explaining this.