cuda block synchronization

I have b number of blocks and each block has t number of threads. I can use

__syncthreads()

to synchronize the threads that are in a particular block. for example

__global void aFunction()
{
for(i=0;i<10;i++)
{
//execute something
__syncthreads();
}
}

But my problem is to synchronize all the threads in all the blocks. How can I do this?

CUDA provides no efficient, reliable inter-block synchronization except kernel launches themselves. If you split your calculation into two kernels and launch them in the same stream, you will be guaranteed that all the blocks in the first kernel are finished and device memory updated before the second kernel starts. The kernel launch overhead is low enough that you shouldn’t worry about it unless your kernels take less than 5 or 10 microseconds.