Synchronization in nested CUDA kernel invocations

kiuhnm03 · July 30, 2020, 3:52pm

I also asked this question on stackoverflow. I hope this is not a problem. Just let me know if it is!

According to a book about CUDA programming, the following code doesn’t need any explicit synchronization to work correctly.

Note that the code computes a reduction, but thread blocks don’t interact with one another. Each thread block computes a partial reduction and then the host (CPU) computes the final reduction.

__global__ void gpuRecursiveReduceNosync (int *g_idata, int *g_odata,
        unsigned int isize)
{
    // set thread ID
    unsigned int tid = threadIdx.x;

    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * blockDim.x;
    int *odata = &g_odata[blockIdx.x];

    // stop condition
    if (isize == 2 && tid == 0)
    {
        g_odata[blockIdx.x] = idata[0] + idata[1];
        return;
    }

    // nested invoke
    int istride = isize >> 1;

    if(istride > 1 && tid < istride)
    {
        idata[tid] += idata[tid + istride];

        if(tid == 0)
        {
            gpuRecursiveReduceNosync<<<1, istride>>>(idata, odata, istride);
        }
    }
}

Wouldn’t it be possible for a child thread to use data that isn’t available yet?

The book offers the following reason for not having any explicit synchronization:

When a child grid is invoked, its view of memory is fully consistent with the parent thread. Because each child thread only needs its parent’s values to conduct the partial reduction, the in-block synchronization performed before the child grids are launched is unnecessary.

What I know is that if the parent writes something before launching a child grid, then the child grid sees that modification. So, because of SIMT, the kernel above would certainly work if the thread block was small enough to fit within a single warp. But we can’t make that assumption here.

Let’s focus on just one thread block since they’re completely independent anyway. Let’s say blockDim.x is 128 and there’s enough data (128 integers) to “fully” use it (not quite “fully” since we only use half of it). The threads with id from 0 to 63 will do, in numpy-like syntax, idata[0:63] += idata[64:127]. This work will be split into two: one warp performs idata[0:31] += idata[64:95] and another one idata[32:63] += idata[96:127].

istride is 128/2 = 64, so thread 0 calls

gpuRecursiveReduceNosync<<<1, 64>>>(idata, odata, 64)

and the one-block child grid starts working on idata[0:63]. But what happens if idata[32:63] is not ready yet because the child grid has been created after the warp with the 0 thread has done its job, but before the other warp has computed the rest of the data?

A __syncthread() before the nested invocation would solve this problem.

147735474 · April 13, 2023, 2:48am

I have the same question here. Do you make it clear? @kiuhnm03

Robert_Crovella · April 13, 2023, 3:40am

this may be of interest

147735474 · April 14, 2023, 2:04am

The link is useful, thanks @Robert_Crovella

Topic		Replies	Views
Dynamic Parallelism parent and child memory consistency CUDA Programming and Performance	2	698	November 11, 2015
Synchronize all blocks in CUDA CUDA Programming and Performance	12	46173	October 25, 2013
calling cudaThreadSynchronize from a kernel CUDA Programming and Performance	7	4709	September 2, 2009
Global Array Reduction CUDA Programming and Performance	4	627	June 24, 2013
question about __syncthreads(); CUDA Programming and Performance	9	8643	March 17, 2008
Synchronization problem How can we synchronize blocks? CUDA Programming and Performance	10	5289	December 4, 2007
is there any function to do sync threads in a grid? CUDA Programming and Performance	2	2452	March 30, 2015
Call to _syncThreads() not needed? CUDA Programming and Performance	2	780	March 10, 2015
Behaviour of Multithreaded programs with cudaThreadSynchronize() The semantics of cudaThreadSynchron CUDA Programming and Performance	1	7215	January 9, 2012
Global Sync CUDA Programming and Performance	7	6024	October 4, 2007

Synchronization in nested CUDA kernel invocations

Related topics