This question was posted here before, but there was never a clear answer to it.
Let’s assume there is a CUDA code and i want to count the FLOPS of the inner loop. I look at it on PTX-level, because some optimizations were already taken into this.
Obvious the add, sub and mul count as 1 FLOP. But what about
fma <- should be 1 or 2 FLOPS ?
rsqrt.approx <- looking at CUDA Programming Guide 5.4.1 one might assume that it takes 8x times longer, so should I count it as 8 ?
div.approx <- same with rsqrt. 1 oder 8 ?
There may be some more optimization after the PTX stage, but I have no way to decompile it. Regarding the CUDA-Code is worthless, because there are many optimizations in PTX already.