You can use the threadfence functions to obtain some kind of synchronization across blocks. In practice the threadfence function locks an address in the main memory. In the programming guide there is an example of how to use threadfence function for a reduction code.
The threadfence functions are memory barriers, not synchronization functions in any form. All they do is force memory contents to be flushed up the memory hierarchy far enough to guarantee visibility at the requested level (block, grid, or host). They do not lock memory locations. Race conditions are still possible.
Global barriers can be hacked together using atomic functions to implement a semaphore in device memory, but they tend to be dangerous because you have no guarantee that all of your blocks are running simultaneously, and blocks waiting at the barrier are not going to be preempted so the non-running blocks can make progress. (I’m reasonably sure that hardware before compute capability 3.5 actually can’t preempt blocks ever.) That will create a deadlock.
You can deliberately limit the number of blocks to be equal to the number of multiprocessors on the device, which will pretty much ensure all blocks are running (but again, the CUDA runtime does not guarantee this). Even then, I would never use an improvised global barrier in production code.
Also, launching kernels from inside kernels is a compute capability 3.5 feature, not a compute capability 3.0 feature.
Why not just implement atomic counters? Either global, or by block?
If the counter value equals your total thread count, then you know all threads have reached this point in execution. And then have that last thread do your summation.
If the work is too much for one thread, then you could issue the atomic counter by block (either by syncthreads in a block, or, preferably, utilizing an atomic counter within each block, and then a global atomic counter). If the global counter equals your block count, then you know all blocks have reached this point, and you can do your summation in the last block.
There is no robust way to do inter-block synchronization in the CUDA programming model, as blocks could even execute serially under that model, leading to deadlocks as described by seibert above.
The robust way to achieve the desired functionality is to launch two kernels, one for each stage of the two-stage reduction. While the second kernel often runs with very low efficiency in such a setup, it also tends to run very briefly, so that overall efficiency of the reduction is completely dominated by the more expensive first stage of the reduction.
I agree with njuffa – the way to synchronize all blocks is to simply run two kernels. But I have found it useful to use atomicInc of a global variable to track when all threads have passed a certain point, and then do trivial cleanup operations with that last thread. This can be extended, such that the last thread can update a flag in shared memory for the block. After the atomicInc and flag update, issue a syncThreads and then check the shared flag. In this way you can detect which thread or block is the last to execute, and then do cleanup. When I do this, I do it as the last step of a kernel. The remaining threads or blocks all complete the kernel and exit. It is only the last to finish that does some work.
As I write this, though, I realize that I always need to call a kernel ahead of time to initialize my globals to zero. So I have not saved a kernel launch, and have introduced a bunch of atomicInc’s. Probably would be better off with a primary kernel and a cleanup kernel.
Hi, my problem seems “race condition”, I don’t know what’s wrong in my code.After I run kernel and I check result in the host, the result will be different in the somewhere of result array.Every time I got different answer.Following is my code, please help me :(
__shared__ int sharebuffer;
int tid = threadIdx.x;
if(i < (*width)*(*height)*N_constant)
sharebuffer[tid] = src[i]*coeff_constant[tid%N_constant];
for( int stride = N_constant, shift = 1; stride > 1; stride>>=1, shift++)
ThreadCount = (stride/2)*((2*tid)/stride)+tid;
if( tid < (blockDim.x >> shift))
sharebuffer[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];
if( tid <(blockDim.x/N_constant))
dst[tid+(blockDim.x/N_constant)*blockIdx.x] = (sharebuffer[tid]+offset_constant) >> shift_constant;