Any cases syncthreads improve performace ?

My kernel is like this

[codebox]

for(row = thread_id; row < limit; row += grid_size)

{   

    int sum = 0;

// … code omit… //

for(int n = start; n < end; n+=height)

    {   

        const int col = Aj[n];

        if (col != -1)

             sum ^= Ap[col]   

    }   

// __syncthreads();

    y[row] = sum;

}   [/codebox]

Aj, Ap and y is global memory. If I add syncthreads it would improve the speed. I can’t explain this, I guess forcing all thread to write to global memory at the same time did it ?

and that may enable memory coalescing → less load on the memory controller, less simultaneous transactions and faster completion. But that’s just my theory.

and that may enable memory coalescing → less load on the memory controller, less simultaneous transactions and faster completion. But that’s just my theory.