I see some differences between the flop efficiency reported by nvprof and what I achieve using pencil and paper.
On 1080Ti, the reported flop efficiency of a kernel is 16.2%.
Looking at the device’s spec and the formula, the peak GFLOP/s is calculated by
SM_COUNT x CUDA_CORE_PER_SM x CLOCK x 2
= 281281.683(GHz)*2
= 12,064 GFLOP/s
Now, when I look at the flop_count_sp, the value is 1,796,739,259 and the kernel runtime is 0.8ms.
So, on paper, the flop value is
1,796,739,259 / (0.8 * 10^-3) = 2,245,924,073,750 (FLOP/s) = 2,245 (GFLOP/s)
Now, the efficiency on paper is:
2,245/12064 = 0.182
Which means 18.2%
So, nvprof says flop efficiency is 16.2% while my calculation show it is 18.2%.
Should we assume that error value is small or something is missing in the calculations?
Any idea?