actual FLOP count versus flops_dp metric

voilouvoila · April 28, 2014, 7:08pm

Hello,

In my code, I was surprised to find that the theoretical number of FLOPs of my algorithm did not coincide with that given by nvvp. After a long time sleuthing through the code, I pinpointed one of the lines that is being miscounted by nvvp (flops_dp).

I am curious to know what is going on behind the scenes for this to happen…

These are the relevant lines of code:

#define CON 3.1415967

__device__ double get_const() {

    return CON;
}

__device__ double func(double *Z) {

    double r1, r2, r3, r4;
    r1 = Z[0];
    r2 = Z[1];
    r3 = Z[2];
    r4 = Z[3];
    
    return (get_const() - 1.) * (r4 - (r2*r2 + r3*r3) / 2. / r1);
}

The return statement counts for flops_dp = 21 double precision floating point operations! :-O

I would only expect either 8 or 7 operations. The former if we count (-, *, -, *, +, *, /, /) = 8 operations. The latter if the first subtraction (get_const() - 1.) is being optimized out by the compiler.

If I remove the final division, making the return statement

return (get_const() - 1.) * (r4 - (r2*r2 + r3*r3) / 2.)

I only get 6 operations which is reasonable, and what I’d expect if the first subtraction is optimized out! Does anyone have an idea what is going on?

njuffa · April 28, 2014, 7:16pm

The GPU hardware has no support for division operations, neither integer nor floating-point. These operations are implemented by software sequences that are either inline code or called subroutines. In the case of double-precision floating-point division this is a called subroutine comprising quite a few floating-point instructions, required to produce a correctly rounded quotient. You can see the details by looking at the generated machine code (SASS) with cuobjdump --dump-sass.

voilouvoila · April 28, 2014, 7:23pm

Always clear responses njuffa, thanks so much.

njuffa · April 28, 2014, 7:33pm

As for the division by 2, I think you will find that it gets optimized into a multiplication with 0.5. This is a safe transformation even under strict IEEE-754 semantics (see also section F.8.2. of the C99 standard which specifically calls this out as a safe and permissible transformation).