I have a simplified version of a big kernel as follows and I’m trying to time the kernel :
global void fluxSum( double* array, double* result, int N)
if ( ThreadID < N )
double arrayValue = array[ThreadID];
double sumx = 0.0;
for ( int i = 0 ; i < 100 ; i++)
sumx = sumx + arrayValue*arrayValue;
// The answer to the above operation on all kernel is 2500.00 // However lets have another variable 'sumy' double sumy = 2500.00 // Plug in the result result[threadId] = sumx ; // Takes 3.5 millisec ------- ( Assignment 1 ) result[threadID] = sumy ; // Takes 0.15 millisec --------( Assignment 2 ) } }
When I time the above kernel, and use Assignment 1, it takes 3.5 milliseconds (I do a cudaThreadSync in the main program).
But, when I use Assignment 2, ( but I still compute sumx, but do not use it ), it takes only 0.15 milliseconds…
Could some one explain to me this behavior ? Am I missing something here…
I agree that, if I do not compute sumx and directly use Assignment 2, I should get a smaller timing, but I still
get a smaller timing if I compute sumx and use sumy …
These were on Tesla C1060, with double precision turned on.