Synchronize all blocks in CUDA

cygnusc · December 26, 2012, 6:38pm

Hello - I think __syncthreads() only synchronizes all threads in the same block and I am looking for a way to synchronize all the blocks in a grid before moving on to the rest of the code.

I did some research and people suggested atomic operations. I am not familiar with it but I thought it can only provide I/O level synchronization. Any idea? :)

Mandar_Gurav · December 26, 2012, 6:49pm

Synchronization across all blocks can be done like this. Divide your kernel into two parts. Once all the blocks from first part is done - the next kernel(for part 2) will be executed. :)

– Mandar Gurav

cygnusc · December 26, 2012, 7:22pm

Thanks. Actually I’ve tried this. In my case the kernel has a loop in it, to iterate itself for millions of times.

After dividing the kernel into two, I had to move the loop outside the kernel, and use CPU to do the launch the kernel repeatedly.

It looks like the whole execution is then significantly slowed down, possibly due to overhead of launching a kernel for millions of times.

pasoleatis · December 26, 2012, 8:55pm

You can use the threadfence functions to obtain some kind of synchronization across blocks. In practice the threadfence function locks an address in the main memory. In the programming guide there is an example of how to use threadfence function for a reduction code.

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDEQFjAA&url=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F11570789%2Fcuda-threadfence&ei=jWbbUPf7BofltQaVnYHYAQ&usg=AFQjCNF0O8UAqo8tY0XY27G9XYqigolbVA&bvm=bv.1355534169,d.Yms

The cc 3.0 support launching kernels from inside a kernel and spawning new threads.

seibert · December 27, 2012, 12:20am

The threadfence functions are memory barriers, not synchronization functions in any form. All they do is force memory contents to be flushed up the memory hierarchy far enough to guarantee visibility at the requested level (block, grid, or host). They do not lock memory locations. Race conditions are still possible.

Global barriers can be hacked together using atomic functions to implement a semaphore in device memory, but they tend to be dangerous because you have no guarantee that all of your blocks are running simultaneously, and blocks waiting at the barrier are not going to be preempted so the non-running blocks can make progress. (I’m reasonably sure that hardware before compute capability 3.5 actually can’t preempt blocks ever.) That will create a deadlock.

You can deliberately limit the number of blocks to be equal to the number of multiprocessors on the device, which will pretty much ensure all blocks are running (but again, the CUDA runtime does not guarantee this). Even then, I would never use an improvised global barrier in production code.

Also, launching kernels from inside kernels is a compute capability 3.5 feature, not a compute capability 3.0 feature.

pasoleatis · December 27, 2012, 2:06pm

Thanks for the clarifying. I needed a global barrier for some simple cases of reduction in 2 steps, where the last block to finish does the final summation.

rjl · December 31, 2012, 6:31pm

Why not just implement atomic counters? Either global, or by block?

If the counter value equals your total thread count, then you know all threads have reached this point in execution. And then have that last thread do your summation.

If the work is too much for one thread, then you could issue the atomic counter by block (either by syncthreads in a block, or, preferably, utilizing an atomic counter within each block, and then a global atomic counter). If the global counter equals your block count, then you know all blocks have reached this point, and you can do your summation in the last block.

njuffa · December 31, 2012, 9:07pm

There is no robust way to do inter-block synchronization in the CUDA programming model, as blocks could even execute serially under that model, leading to deadlocks as described by seibert above.

The robust way to achieve the desired functionality is to launch two kernels, one for each stage of the two-stage reduction. While the second kernel often runs with very low efficiency in such a setup, it also tends to run very briefly, so that overall efficiency of the reduction is completely dominated by the more expensive first stage of the reduction.

rjl · January 1, 2013, 1:24am

I agree with njuffa – the way to synchronize all blocks is to simply run two kernels. But I have found it useful to use atomicInc of a global variable to track when all threads have passed a certain point, and then do trivial cleanup operations with that last thread. This can be extended, such that the last thread can update a flag in shared memory for the block. After the atomicInc and flag update, issue a syncThreads and then check the shared flag. In this way you can detect which thread or block is the last to execute, and then do cleanup. When I do this, I do it as the last step of a kernel. The remaining threads or blocks all complete the kernel and exit. It is only the last to finish that does some work.

As I write this, though, I realize that I always need to call a kernel ahead of time to initialize my globals to zero. So I have not saved a kernel launch, and have introduced a bunch of atomicInc’s. Probably would be better off with a primary kernel and a cleanup kernel.

baby1120 · October 13, 2013, 5:37pm

Hi , I had same problem as original poster said, can anyone show a example code for solution please??

baby1120 · October 13, 2013, 5:41pm

Hi , I had same problem as original poster said, can anyone show a example code for solution please??

pasoleatis · October 13, 2013, 5:52pm

Hello,

Are your problem is the block synchronization? You can set a global uch as:

device int blockcounter;
and add in your kernel at the end:

__syncthreads();
atomicAdd(&blockcounter,1);

After the call you need another kernel with <<<1,1>>> and a printf command.

baby1120 · October 25, 2013, 1:57am

Hi, my problem seems “race condition”, I don’t know what’s wrong in my code.After I run kernel and I check result in the host, the result will be different in the somewhere of result array.Every time I got different answer.Following is my code, please help me :(

__shared__ int sharebuffer[512];
int i=blockDim.x*blockIdx.x+threadIdx.x;
int tid = threadIdx.x;
int Thread_Threshold;
int ThreadCount;
int index;
if(i < (*width)*(*height)*N_constant)
{
    sharebuffer[tid] = src[i]*coeff_constant[tid%N_constant];
    __syncthreads();
    for( int stride = N_constant, shift = 1; stride > 1; stride>>=1, shift++)
    {
      	ThreadCount = (stride/2)*((2*tid)/stride)+tid;
	if( tid < (blockDim.x >> shift))
		sharebuffer[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];
	__syncthreads();
    }
    if( tid <(blockDim.x/N_constant))
	dst[tid+(blockDim.x/N_constant)*blockIdx.x] = (sharebuffer[tid]+offset_constant) >> shift_constant;
}