well I am running a kernel with only as many blocks as number of multi-processors. these blocks run in a loop to cover all the data elements. this i am doing because each block uses the shared memory almost entirely.
i need to synchronize all blocks i.e. have a global barrier since every block needs data written by every other block to the global memory.
i wrote the following code for global barrier
device unsigned int syncCounter = 0;
device void syncBlocks()
{
if(threadIdx.x == 0)
{
atomicInc(&syncCounter, gridDim.x -1);
__threadfence();
}
__syncthreads();
volatile unsigned int* counter = &syncCounter;
do{
}while(*counter > 0);
}
according to the code, the block that executes this function last is supposed to set syncCounter which is a global variable back to 0. Until then all blocks are supposed to hang around in the while loop.
the problem is that even if the second arg of atomicInc() is an extremely large number such that syncCounter is never set back to 0, the blocks still come out of the loop and the prog doesnt hang and so i dont know if the global barrier is happening or not. i cant use deviceemu mode to debug as it executes all blocks serially causing the prog to hang.
help!!
–
sid