I hit a problem that has completely baffled me and I think that I might be misunderstanding the programming guide. It’s my understanding that for a thread block of size n x m x 1 that all the threads will be given x-indices from 0 to n and y-indices from 0 to m. I’m working on a pretty simple search-type kernel (obstacle detection) where each thread block outputs to a single global variable if it finds a collision. I’d like each thread in a block to return immediately if any thread in the block finds a collision. We’ll be getting a compute-capability 2.0 card soon in which case I’ll just use __syncthreads_or() to signal the break, but for now I’m trying to make this work on a 1.2 card. In order to signal the terminate, I have each thread that finds a collision write to a shared variable, then synchronize, and then break conditioned on that shared variable (below).
Now, as a first step to speeding things up I figured that only one thread in the block should try to write to global memory so I use a “if(threadIdx.x == 0 && threadIdx.y == 0)” conditional. However, when that happens no value is written to global memory. If I remove this condition, then the value is written to global memory. Now, I can’t figure out why this is possible. For sure there is always one thread with Idx.x ==0 and Idx.y ==0 right? Can anyone tell what am I missing here?
// shared variable for signaling an abort
__shared__ int sCollision;
// thread-wise storage for collision result
int collision = 0;
// the output offset is given by the batchSize*batchId + pairId
const unsigned int outOffset = gridDim.y * blockIdx.x + blockIdx.y;
…
///\todo warp-vote and warp-sync the collision writes
if(collision)
sCollision=1;
// synchronize so that all writes to the flag complete before we try to
// read it
__syncthreads();
if(sCollision)
{
///\todo make it so that only one thread writes to the output
//if(threadIdx.x == 0 && threadIdx.y == 0)
output_d[outOffset] = 1;
break;
}