Reduction operation on every thread and Execution Times


I have a simplified version of a big kernel as follows and I’m trying to time the kernel :

global void fluxSum( double* array, double* result, int N)
if ( ThreadID < N )
double arrayValue = array[ThreadID];
double sumx = 0.0;
for ( int i = 0 ; i < 100 ; i++)
sumx = sumx + arrayValue*arrayValue;

        // The answer to the above operation on all kernel is 2500.00
        // However lets have another variable 'sumy'
        double sumy = 2500.00

       // Plug in the result
       result[threadId] = sumx ; //  Takes 3.5 millisec -------  ( Assignment  1 )
       result[threadID] = sumy ; //  Takes 0.15 millisec --------( Assignment  2 )


When I time the above kernel, and use Assignment 1, it takes 3.5 milliseconds (I do a cudaThreadSync in the main program).
But, when I use Assignment 2, ( but I still compute sumx, but do not use it ), it takes only 0.15 milliseconds…
Could some one explain to me this behavior ? Am I missing something here…

I agree that, if I do not compute sumx and directly use Assignment 2, I should get a smaller timing, but I still
get a smaller timing if I compute sumx and use sumy …

These were on Tesla C1060, with double precision turned on.


If you do not use sumx, the compiler optimizes away its whole computation as well.


But I will need to use sumx in the actual kernel. Why does it take less time when I use sumy although sumx is computed ?

it probably isn’t computed because the compiler understands it doesn’t need to be computed

The only question is: is the output of the kernel (write to global memory) dependent on sumx? If the answer is no, like in the case of your code shown above, ptxas will automatically do away with all the computation for sumx. Though you could supply a command line option asking ptxas to not do any optimization at all…

Very true… The kernel actually does not compute sumx if it is not used. I tried to add additional floating point operations in sumx but the time remained the same.