Block Synchronization

Hello,

I have implemented following function for block synchronization (to sync threads between two different blocks).
device void inter_block_barrier(unsigned int count)
{
int value;
if((threadIdx.x==0)&&(threadIdx.y==0))
{
value=atomicInc(count,gridDim.x
gridDim.y);
while(count[0] !=0);

    }

__syncthreads();
}
I am using 20,000 blocks in my code, each having 64 threads. My device is GT200. This function does not seem to synchronize blocks. Can anyone see any flaw in logic ?
the variable count[0] is made zero before calling function (thought ideally its not needed).

thanks adn regards,
Nachiket

If you have 20,000 blocks, most of the blocks are not running simultaneously. The scheduler loads as many blocks as resources allow onto the multiprocessors, lets them run to completion, then runs more blocks. There is no context switching of blocks in and out of multiprocessors, which is the only way your solution would work.

If you need to synchronize the blocks in your kernel, you should write two kernel calls and queue them up. Kernel completion is the only officially recommended synchronization barrier across blocks. (And the overhead is not huge, so unless your kernel is very short, it is often not a huge problem.)