we try to keep a count of results to be outputed. A simple pseudo code is like this:
for(…)
{
if(…)
A: ++cnt; //thread register
}
__syncthreads();
B: d_cnts[bx * blockDim.x + tx] = cnt; //global memory
We expected this to be much faster than
for(…)
{
if(…)
++d_cnts[bx * blockDim.x + tx] ;
}
However, the timing result shows that they consume more or less the same amount of time, and are both very slow.
By commenting out either line A or B, however, the routine will be blazing fast. (more than 5 to 6 times difference)
My rationalization of this phenomenon:
count is not allocated on register, and count++ is extremely slow.
When line B is commented out, since count is never read back, the slow count++ operation is omitted by the smart compiler.
This theory seems to explain the weirdness, but do we have a solution for this? Or does anyone have a different explanation?