no thread with threadIdx.x==0 && threadIdx.y ==0?

I hit a problem that has completely baffled me and I think that I might be misunderstanding the programming guide. It’s my understanding that for a thread block of size n x m x 1 that all the threads will be given x-indices from 0 to n and y-indices from 0 to m. I’m working on a pretty simple search-type kernel (obstacle detection) where each thread block outputs to a single global variable if it finds a collision. I’d like each thread in a block to return immediately if any thread in the block finds a collision. We’ll be getting a compute-capability 2.0 card soon in which case I’ll just use __syncthreads_or() to signal the break, but for now I’m trying to make this work on a 1.2 card. In order to signal the terminate, I have each thread that finds a collision write to a shared variable, then synchronize, and then break conditioned on that shared variable (below).

Now, as a first step to speeding things up I figured that only one thread in the block should try to write to global memory so I use a “if(threadIdx.x == 0 && threadIdx.y == 0)” conditional. However, when that happens no value is written to global memory. If I remove this condition, then the value is written to global memory. Now, I can’t figure out why this is possible. For sure there is always one thread with Idx.x ==0 and Idx.y ==0 right? Can anyone tell what am I missing here?

// shared variable for signaling an abort

__shared__ int sCollision;

// thread-wise storage for collision result

int collision = 0;

// the output offset is given by the batchSize*batchId + pairId

const unsigned int outOffset      = gridDim.y * blockIdx.x + blockIdx.y;

///\todo warp-vote and warp-sync the collision writes



// synchronize so that all writes to the flag complete before we try to

// read it




    ///\todo make it so that only one thread writes to the output

    //if(threadIdx.x == 0 && threadIdx.y == 0)

        output_d[outOffset] = 1;



I can’t tell because you are not posting complete code. My first guess however is that you might be missing a [font=“Courier New”]__syncthreads()[/font] between initialization of [font=“Courier New”]sCollision[/font] and the code block you posted.

I can post the complete code if you’d be willing to look at it. I don’t think the initialization of sCollision could affect this part though, since __syncthreads() is called immediately before sCollision is read. This is how it’s initialized though.

if(threadIdx.x == 0 && threadIdx.y == 0)

    sCollision = 0;


The whole kernel is only about 100 lines. I’d be happy to post the whole thing.

Indeed that looks correct. So I think we can only find the problem by looking at the whole kernel.