fma() (and fmaf()) to optimize?

I have code in a hot-spot kernel that looks like this:

d[0] = k1[0] * k2[0] - k1[1] * k2[1]

(where all variables are floats)

I am wondering if there is an opportunity to optimize this at all with fmaf()?

Possible answers:
a) yes, do this:
float tmp = -k1[1] * k2[1];
d[0] = fmaf(k1[0], k2[0], tmp);

b) no, the nvcc compiler will optimize this to fmaf() for you already

c) no, you should be doing something else entirely.

I ask because when I replace my original code with code in (a), I do not see a difference in timing studies.

I realize that caching plays a big part. I don’t see any opportunity to pre-load these operands into shared memory before computation (the values are only used once), but due to sequential access pattern across threads within the block, it seems possible/likely that the operands may already be loaded in L1 cache due to coalesced access and cache line loading, but I don’t know for certain.

Any advice? Are there gains to be made here, or am I barking up the wrong tree? Thanks!

My first advice would be learn to use the CUDA binary utilities and understand what the compiler is doing. fma contractions are usually useful from a performance perspective, and occasionally useful from an accuracy perspective. But the compiler will aggressively seek to use them. It normally should not be necessary to go to great pains to craft fma operations yourself.

If you don’t see a change in performance after code refactoring, it’s normally a solid indication that you are barking up the wrong tree. The limiter to code performance is often somewhere else.

Which brings up the 2nd piece of advice. Learn to use the profilers, and develop skills for analysis driven optimization.

The answer is likely one of (b) or ©.

To first order, the answer is (b). In release builds, the CUDA compiler makes aggressive use of FMA-contraction, so there is usually no need to do this transformation manually. However, since only one of the two products can be merged into an FMA (and you won’t be able to pick which one), the code could fail in rare cases, for example by producing a tiny negative result instead of zero, making a following square root deliver a NaN.

A superior replacement in terms of numerical accuracy and robustness is the piece of code shown below, which however requires more operations than the original code. If your code is not compute bound (something the CUDA profiler can help you determine), my recommendation would be to use this code, i.e. pick variant ©. If your code is compute bound, you will have to make a judgement call whether you value robustness and accuracy over raw performance.

  diff_of_products() computes a*b-c*d with a maximum error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
T diff_of_products (T a, T b, T c, T d)
    T w = d * c;
    T e = fma (-d, c, w);
    T f = fma (a, b, -w);
    return f + e;