My kernel is like this
[codebox]
for(row = thread_id; row < limit; row += grid_size)
{
int sum = 0;
// … code omit… //
for(int n = start; n < end; n+=height)
{
const int col = Aj[n];
if (col != -1)
sum ^= Ap[col]
}
// __syncthreads();
y[row] = sum;
} [/codebox]
Aj, Ap and y is global memory. If I add syncthreads it would improve the speed. I can’t explain this, I guess forcing all thread to write to global memory at the same time did it ?