A global barrier for blocks the barrier is failing...

well I am running a kernel with only as many blocks as number of multi-processors. these blocks run in a loop to cover all the data elements. this i am doing because each block uses the shared memory almost entirely.

i need to synchronize all blocks i.e. have a global barrier since every block needs data written by every other block to the global memory.

i wrote the following code for global barrier

device unsigned int syncCounter = 0;

device void syncBlocks()
{
if(threadIdx.x == 0)
{
atomicInc(&syncCounter, gridDim.x -1);
__threadfence();
}
__syncthreads();
volatile unsigned int* counter = &syncCounter;
do{
}while(*counter > 0);
}

according to the code, the block that executes this function last is supposed to set syncCounter which is a global variable back to 0. Until then all blocks are supposed to hang around in the while loop.

the problem is that even if the second arg of atomicInc() is an extremely large number such that syncCounter is never set back to 0, the blocks still come out of the loop and the prog doesnt hang and so i dont know if the global barrier is happening or not. i cant use deviceemu mode to debug as it executes all blocks serially causing the prog to hang.

help!!


sid

Speaking from experience, you should not attempt to go down this path.

Seriously. Don’t.

There is an implicit barrier between kernel launches. Just call the same kernel multiple times.

I wish i could do multiple kernel launches…but the problem is my entire data is in the shared memory and to write it to global memory array i need to compute offset for each block. and to compute this offset i need data computed by other blocks. so if i quit the kernel my data in shared memory will be lost. any way out?

seems like you need to figure out how to get out of that catch 22.