++threadReg as slow as ++d_globalMem[tx] memory access time

we try to keep a count of results to be outputed. A simple pseudo code is like this:

for(…)
{
if(…)
A: ++cnt; //thread register
}
__syncthreads();
B: d_cnts[bx * blockDim.x + tx] = cnt; //global memory

We expected this to be much faster than
for(…)
{
if(…)
++d_cnts[bx * blockDim.x + tx] ;
}

However, the timing result shows that they consume more or less the same amount of time, and are both very slow.
By commenting out either line A or B, however, the routine will be blazing fast. (more than 5 to 6 times difference)

My rationalization of this phenomenon:
count is not allocated on register, and count++ is extremely slow.
When line B is commented out, since count is never read back, the slow count++ operation is omitted by the smart compiler.

This theory seems to explain the weirdness, but do we have a solution for this? Or does anyone have a different explanation?

Why not look at the .ptx file generated by nvcc to verify your theory?

Mark

This is expected. The for loop doesn’t make sense without one or the other. The compiler will optimize it away completely.

Peter