Synchronization in nested CUDA kernel invocations

I also asked this question on stackoverflow. I hope this is not a problem. Just let me know if it is!

According to a book about CUDA programming, the following code doesn’t need any explicit synchronization to work correctly.

Note that the code computes a reduction, but thread blocks don’t interact with one another. Each thread block computes a partial reduction and then the host (CPU) computes the final reduction.

__global__ void gpuRecursiveReduceNosync (int *g_idata, int *g_odata,
        unsigned int isize)
    // set thread ID
    unsigned int tid = threadIdx.x;

    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * blockDim.x;
    int *odata = &g_odata[blockIdx.x];

    // stop condition
    if (isize == 2 && tid == 0)
        g_odata[blockIdx.x] = idata[0] + idata[1];

    // nested invoke
    int istride = isize >> 1;

    if(istride > 1 && tid < istride)
        idata[tid] += idata[tid + istride];

        if(tid == 0)
            gpuRecursiveReduceNosync<<<1, istride>>>(idata, odata, istride);

Wouldn’t it be possible for a child thread to use data that isn’t available yet?

The book offers the following reason for not having any explicit synchronization:

When a child grid is invoked, its view of memory is fully consistent with the parent thread. Because each child thread only needs its parent’s values to conduct the partial reduction, the in-block synchronization performed before the child grids are launched is unnecessary.

What I know is that if the parent writes something before launching a child grid, then the child grid sees that modification. So, because of SIMT, the kernel above would certainly work if the thread block was small enough to fit within a single warp. But we can’t make that assumption here.

Let’s focus on just one thread block since they’re completely independent anyway. Let’s say blockDim.x is 128 and there’s enough data (128 integers) to “fully” use it (not quite “fully” since we only use half of it). The threads with id from 0 to 63 will do, in numpy-like syntax, idata[0:63] += idata[64:127]. This work will be split into two: one warp performs idata[0:31] += idata[64:95] and another one idata[32:63] += idata[96:127].

istride is 128/2 = 64, so thread 0 calls

gpuRecursiveReduceNosync<<<1, 64>>>(idata, odata, 64)

and the one-block child grid starts working on idata[0:63]. But what happens if idata[32:63] is not ready yet because the child grid has been created after the warp with the 0 thread has done its job, but before the other warp has computed the rest of the data?

A __syncthread() before the nested invocation would solve this problem.