Hi,

I have a simplified version of a big kernel as follows and I’m trying to time the kernel :

**global** void fluxSum( double* array, double* result, int N)

{

if ( ThreadID < N )

{

double arrayValue = array[ThreadID];

double sumx = 0.0;

for ( int i = 0 ; i < 100 ; i++)

sumx = sumx + arrayValue*arrayValue;

```
// The answer to the above operation on all kernel is 2500.00
// However lets have another variable 'sumy'
double sumy = 2500.00
// Plug in the result
result[threadId] = sumx ; // Takes 3.5 millisec ------- ( Assignment 1 )
result[threadID] = sumy ; // Takes 0.15 millisec --------( Assignment 2 )
}
}
```

When I time the above kernel, and use Assignment 1, it takes 3.5 milliseconds (I do a cudaThreadSync in the main program).

But, when I use Assignment 2, ( but I still compute sumx, but do not use it ), it takes only 0.15 milliseconds…

Could some one explain to me this behavior ? Am I missing something here…

I agree that, if I do not compute sumx and directly use Assignment 2, I should get a smaller timing, but I still

get a smaller timing if I compute sumx and use sumy …

These were on Tesla C1060, with double precision turned on.

Thanks,

Dominic