I have code in a hot-spot kernel that looks like this:
d = k1 * k2 - k1 * k2
(where all variables are floats)
I am wondering if there is an opportunity to optimize this at all with fmaf()?
a) yes, do this:
float tmp = -k1 * k2;
d = fmaf(k1, k2, tmp);
b) no, the nvcc compiler will optimize this to fmaf() for you already
c) no, you should be doing something else entirely.
I ask because when I replace my original code with code in (a), I do not see a difference in timing studies.
I realize that caching plays a big part. I don’t see any opportunity to pre-load these operands into shared memory before computation (the values are only used once), but due to sequential access pattern across threads within the block, it seems possible/likely that the operands may already be loaded in L1 cache due to coalesced access and cache line loading, but I don’t know for certain.
Any advice? Are there gains to be made here, or am I barking up the wrong tree? Thanks!