In my code, I was surprised to find that the theoretical number of FLOPs of my algorithm did not coincide with that given by nvvp. After a long time sleuthing through the code, I pinpointed one of the lines that is being miscounted by nvvp (flops_dp).
I am curious to know what is going on behind the scenes for this to happen…
The return statement counts for flops_dp = 21 double precision floating point operations! :-O
I would only expect either 8 or 7 operations. The former if we count (-, *, -, *, +, *, /, /) = 8 operations. The latter if the first subtraction (get_const() - 1.) is being optimized out by the compiler.
If I remove the final division, making the return statement
The GPU hardware has no support for division operations, neither integer nor floating-point. These operations are implemented by software sequences that are either inline code or called subroutines. In the case of double-precision floating-point division this is a called subroutine comprising quite a few floating-point instructions, required to produce a correctly rounded quotient. You can see the details by looking at the generated machine code (SASS) with cuobjdump --dump-sass.
As for the division by 2, I think you will find that it gets optimized into a multiplication with 0.5. This is a safe transformation even under strict IEEE-754 semantics (see also section F.8.2. of the C99 standard which specifically calls this out as a safe and permissible transformation).
When running a tiny kernel with just some division my Nsight report show about 7 FFMA operations per expected division. Each can be considered 2 FLOPS so 1 division in code would result in 14 FLOPS.
When I analyze my larger kernel though I match my theoretical FLOPS with the Nsight report if I count 15 FLOPS per division. This is all very emperical and it would be nice if the profiling tools would report on what constitutes a division. Also because it is not always obvious as sometimes there could be faster and slower divisions based on accuracy,