actual FLOP count versus flops_dp metric

Hello,

In my code, I was surprised to find that the theoretical number of FLOPs of my algorithm did not coincide with that given by nvvp. After a long time sleuthing through the code, I pinpointed one of the lines that is being miscounted by nvvp (flops_dp).

I am curious to know what is going on behind the scenes for this to happen…

These are the relevant lines of code:

#define CON 3.1415967

__device__ double get_const() {

    return CON;
}

__device__ double func(double *Z) {

    double r1, r2, r3, r4;
    r1 = Z[0];
    r2 = Z[1];
    r3 = Z[2];
    r4 = Z[3];
    
    return (get_const() - 1.) * (r4 - (r2*r2 + r3*r3) / 2. / r1);
}

The return statement counts for flops_dp = 21 double precision floating point operations! :-O

I would only expect either 8 or 7 operations. The former if we count (-, *, -, *, +, *, /, /) = 8 operations. The latter if the first subtraction (get_const() - 1.) is being optimized out by the compiler.

If I remove the final division, making the return statement

return (get_const() - 1.) * (r4 - (r2*r2 + r3*r3) / 2.)

I only get 6 operations which is reasonable, and what I’d expect if the first subtraction is optimized out! Does anyone have an idea what is going on?

The GPU hardware has no support for division operations, neither integer nor floating-point. These operations are implemented by software sequences that are either inline code or called subroutines. In the case of double-precision floating-point division this is a called subroutine comprising quite a few floating-point instructions, required to produce a correctly rounded quotient. You can see the details by looking at the generated machine code (SASS) with cuobjdump --dump-sass.

Always clear responses njuffa, thanks so much.

As for the division by 2, I think you will find that it gets optimized into a multiplication with 0.5. This is a safe transformation even under strict IEEE-754 semantics (see also section F.8.2. of the C99 standard which specifically calls this out as a safe and permissible transformation).

When running a tiny kernel with just some division my Nsight report show about 7 FFMA operations per expected division. Each can be considered 2 FLOPS so 1 division in code would result in 14 FLOPS.
When I analyze my larger kernel though I match my theoretical FLOPS with the Nsight report if I count 15 FLOPS per division. This is all very emperical and it would be nice if the profiling tools would report on what constitutes a division. Also because it is not always obvious as sometimes there could be faster and slower divisions based on accuracy,