Dynamic Parallelism parent and child memory consistency

Hi all,

I am a newbie on GPU programming.
Now I am reading a book “Professional CUDA C Programming” and got confused about the parent and child memory consistency of dynamic parallelism.

It supplied an example, which implements the parallel reduction using dynamic parallelism.

The main function:

int main(int argc, char **argv)

    int nblock  = 2048;
    int nthread = 512;   // initial block size

    int size = nblock * nthread; // total number of elements to reduceNeighbored
    dim3 block (nthread, 1);
    dim3 grid  ((size + block.x - 1) / block.x, 1);

    // allocate host memory
    size_t bytes = size * sizeof(int);
    int *h_idata = (int *) malloc(bytes);
    int *h_odata = (int *) malloc(grid.x * sizeof(int));

    // initialize the array
    for (int i = 0; i < size; i++)
        h_idata[i] = (int)( rand() & 0xFF );
        h_idata[i] = 1;

    // allocate device memory
    int *d_idata = NULL;
    int *d_odata = NULL;
    cudaMalloc((void **) &d_idata, bytes);
    cudaMalloc((void **) &d_odata, grid.x * sizeof(int));

    cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice);
    gpuRecursiveReduce2<<<grid, block.x / 2>>>(d_idata, d_odata, block.x / 2,block.x);
    cudaMemcpy(h_odata, d_odata, grid.x * sizeof(int), cudaMemcpyDeviceToHost);
    gpu_sum = 0;
    for (int i = 0; i < grid.x; i++) gpu_sum += h_odata[i];

    // free  memory

    // reset device

    return EXIT_SUCCESS;

The gpuRecursiveReduce2( ) is :

__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride, int const iDim)
    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * iDim;

    // stop condition
    if (iStride == 1 && threadIdx.x == 0)
        g_odata[blockIdx.x] = idata[0] + idata[1];

    // in place reduction
    idata[threadIdx.x] += idata[threadIdx.x + iStride];

    // nested invocation to generate child grids
    if(threadIdx.x == 0 && blockIdx.x == 0)
        gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
                iStride / 2, iDim);

The child grid launch strategy of the gpuRecursiveReduce2 is to create child grid by the first thread in the first block ( if threadIdx.x==0 && blockldx.x==0)

Then the threads in the child grid would access all the data computed by all the threads of different thread blocks in the parent grid. As all the data are stored in the global memory, the child grid can access all the data with the address.

But does it mean that the child grid could make sure all the data have been computed by all the threads of different thread blocks in the parent grid?

I check the CUDA programming guide about the dynamic parallelism. It says "Since thread
0 of the parent is performing the launch, the child will be consistent with the memory
seen by thread 0 of the parent. Due to the first __syncthreads() call, the child will see
data[0]=0, data[1]=1, …, data[255]=255 (without the __syncthreads() call, only
data[0] would be guaranteed to be seen by the child). "

From the quoting, it is not possible for the child grid to see the data computed by other threads of the same thread block, which has the launched thread. Thus, it is certainly not possible for the child grid to see the data computed by threads of other thread blocks.

But the example shows that the child grid could see the data computed by all the threads in the parent grid and ensure all the data have been computed by all the threads in the parent grid.

Thanks a lot.

I think someone else believes so too:


I haven’t carefully analyzed the code behavior, but it appears to me that a __syncthreads() should be inserted between lines 14 and 17 of the gpuRecursiveReduce2 code you have posted.

Hi txbob,
Thanks a lot. I think the example is wrong. :)