 # actual FLOP count versus flops_dp metric

In my code, I was surprised to find that the theoretical number of FLOPs of my algorithm did not coincide with that given by nvvp. After a long time sleuthing through the code, I pinpointed one of the lines that is being miscounted by nvvp (flops_dp).

I am curious to know what is going on behind the scenes for this to happen…

These are the relevant lines of code:

``````#define CON 3.1415967

__device__ double get_const() {

return CON;
}

__device__ double func(double *Z) {

double r1, r2, r3, r4;
r1 = Z;
r2 = Z;
r3 = Z;
r4 = Z;

return (get_const() - 1.) * (r4 - (r2*r2 + r3*r3) / 2. / r1);
}
``````

The return statement counts for flops_dp = 21 double precision floating point operations! :-O

I would only expect either 8 or 7 operations. The former if we count (-, *, -, *, +, *, /, /) = 8 operations. The latter if the first subtraction (get_const() - 1.) is being optimized out by the compiler.

If I remove the final division, making the return statement

``````return (get_const() - 1.) * (r4 - (r2*r2 + r3*r3) / 2.)
``````

I only get 6 operations which is reasonable, and what I’d expect if the first subtraction is optimized out! Does anyone have an idea what is going on?

The GPU hardware has no support for division operations, neither integer nor floating-point. These operations are implemented by software sequences that are either inline code or called subroutines. In the case of double-precision floating-point division this is a called subroutine comprising quite a few floating-point instructions, required to produce a correctly rounded quotient. You can see the details by looking at the generated machine code (SASS) with cuobjdump --dump-sass.

As for the division by 2, I think you will find that it gets optimized into a multiplication with 0.5. This is a safe transformation even under strict IEEE-754 semantics (see also section F.8.2. of the C99 standard which specifically calls this out as a safe and permissible transformation).