Dynamic Parallelism parent and child memory consistency

Shanshan · November 10, 2015, 10:40pm

Hi all,

I am a newbie on GPU programming.
Now I am reading a book “Professional CUDA C Programming” and got confused about the parent and child memory consistency of dynamic parallelism.

It supplied an example, which implements the parallel reduction using dynamic parallelism.

The main function:

int main(int argc, char **argv)
{
    CHECK(cudaSetDevice(0));

    int nblock  = 2048;
    int nthread = 512;   // initial block size

    int size = nblock * nthread; // total number of elements to reduceNeighbored
    dim3 block (nthread, 1);
    dim3 grid  ((size + block.x - 1) / block.x, 1);

    // allocate host memory
    size_t bytes = size * sizeof(int);
    int *h_idata = (int *) malloc(bytes);
    int *h_odata = (int *) malloc(grid.x * sizeof(int));

    // initialize the array
    for (int i = 0; i < size; i++)
    {
        h_idata[i] = (int)( rand() & 0xFF );
        h_idata[i] = 1;
    }

    // allocate device memory
    int *d_idata = NULL;
    int *d_odata = NULL;
    cudaMalloc((void **) &d_idata, bytes);
    cudaMalloc((void **) &d_odata, grid.x * sizeof(int));


    cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice);
    gpuRecursiveReduce2<<<grid, block.x / 2>>>(d_idata, d_odata, block.x / 2,block.x);
    cudaDeviceSynchronize();
    cudaMemcpy(h_odata, d_odata, grid.x * sizeof(int), cudaMemcpyDeviceToHost);
    gpu_sum = 0;
    for (int i = 0; i < grid.x; i++) gpu_sum += h_odata[i];

    // free  memory
    free(h_idata);
    free(h_odata);
    CHECK(cudaFree(d_idata));
    CHECK(cudaFree(d_odata));

    // reset device
    CHECK(cudaDeviceReset());

    return EXIT_SUCCESS;
}

The gpuRecursiveReduce2( ) is :

__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride, int const iDim)
{
    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * iDim;

    // stop condition
    if (iStride == 1 && threadIdx.x == 0)
    {
        g_odata[blockIdx.x] = idata[0] + idata[1];
        return;
    }

    // in place reduction
    idata[threadIdx.x] += idata[threadIdx.x + iStride];

    // nested invocation to generate child grids
    if(threadIdx.x == 0 && blockIdx.x == 0)
    {
        gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
                iStride / 2, iDim);
    }
}

The child grid launch strategy of the gpuRecursiveReduce2 is to create child grid by the first thread in the first block ( if threadIdx.x==0 && blockldx.x==0)

Then the threads in the child grid would access all the data computed by all the threads of different thread blocks in the parent grid. As all the data are stored in the global memory, the child grid can access all the data with the address.

But does it mean that the child grid could make sure all the data have been computed by all the threads of different thread blocks in the parent grid?

I check the CUDA programming guide about the dynamic parallelism. It says "Since thread
0 of the parent is performing the launch, the child will be consistent with the memory
seen by thread 0 of the parent. Due to the first __syncthreads() call, the child will see
data[0]=0, data[1]=1, …, data[255]=255 (without the __syncthreads() call, only
data[0] would be guaranteed to be seen by the child). "

From the quoting, it is not possible for the child grid to see the data computed by other threads of the same thread block, which has the launched thread. Thus, it is certainly not possible for the child grid to see the data computed by threads of other thread blocks.

But the example shows that the child grid could see the data computed by all the threads in the parent grid and ensure all the data have been computed by all the threads in the parent grid.

Thanks a lot.
Shanshan

Robert_Crovella · November 10, 2015, 11:23pm

I think someone else believes so too:

[url]http://p2p.wrox.com/book-professional-professional-cuda-c-programming/94183-gpurecursivereduce2.html[/url]

I haven’t carefully analyzed the code behavior, but it appears to me that a __syncthreads() should be inserted between lines 14 and 17 of the gpuRecursiveReduce2 code you have posted.

Shanshan · November 11, 2015, 2:57pm

Hi txbob,
Thanks a lot. I think the example is wrong. :)

Regards,
Shanshan

Topic		Replies	Views
Dynamic Parallelism Memory Consistency across Thread Blocks? CUDA Programming and Performance	11	1763	February 5, 2015
Memory Synchronisation if using dynamic parallelism CUDA Programming and Performance	0	346	December 22, 2020
Even without sync, a parallel reduction sum using dynamic parallelism works !? CUDA Programming and Performance	2	935	March 14, 2017
Dynamic Parallelism with texture memory CUDA Programming and Performance cuda	4	330	July 25, 2023
Synchronization in nested CUDA kernel invocations CUDA Programming and Performance cuda , kernel	3	736	April 14, 2023
Another question regarding the bizarre behavior of grid nesting and synchronization in cuda samples CUDA Programming and Performance	0	576	April 8, 2018
Is dynamic parallelism suitable for this application? CUDA Programming and Performance	3	1208	August 20, 2013
Dynamic parallelism, Kernel didn't launch CUDA Programming and Performance	6	1901	September 12, 2016
Passing dynamically allocated memory in kernel to sub kernel via dynamic parallelism CUDA Programming and Performance	6	717	May 3, 2019
dynamic parallelism CUDA Programming and Performance	3	1112	December 30, 2012

Dynamic Parallelism parent and child memory consistency

Related topics