Hi all,
I am a newbie on GPU programming.
Now I am reading a book “Professional CUDA C Programming” and got confused about the parent and child memory consistency of dynamic parallelism.
It supplied an example, which implements the parallel reduction using dynamic parallelism.
The main function:
int main(int argc, char **argv)
{
CHECK(cudaSetDevice(0));
int nblock = 2048;
int nthread = 512; // initial block size
int size = nblock * nthread; // total number of elements to reduceNeighbored
dim3 block (nthread, 1);
dim3 grid ((size + block.x - 1) / block.x, 1);
// allocate host memory
size_t bytes = size * sizeof(int);
int *h_idata = (int *) malloc(bytes);
int *h_odata = (int *) malloc(grid.x * sizeof(int));
// initialize the array
for (int i = 0; i < size; i++)
{
h_idata[i] = (int)( rand() & 0xFF );
h_idata[i] = 1;
}
// allocate device memory
int *d_idata = NULL;
int *d_odata = NULL;
cudaMalloc((void **) &d_idata, bytes);
cudaMalloc((void **) &d_odata, grid.x * sizeof(int));
cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice);
gpuRecursiveReduce2<<<grid, block.x / 2>>>(d_idata, d_odata, block.x / 2,block.x);
cudaDeviceSynchronize();
cudaMemcpy(h_odata, d_odata, grid.x * sizeof(int), cudaMemcpyDeviceToHost);
gpu_sum = 0;
for (int i = 0; i < grid.x; i++) gpu_sum += h_odata[i];
// free memory
free(h_idata);
free(h_odata);
CHECK(cudaFree(d_idata));
CHECK(cudaFree(d_odata));
// reset device
CHECK(cudaDeviceReset());
return EXIT_SUCCESS;
}
The gpuRecursiveReduce2( ) is :
__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride, int const iDim)
{
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * iDim;
// stop condition
if (iStride == 1 && threadIdx.x == 0)
{
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
// nested invocation to generate child grids
if(threadIdx.x == 0 && blockIdx.x == 0)
{
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
}
}
The child grid launch strategy of the gpuRecursiveReduce2 is to create child grid by the first thread in the first block ( if threadIdx.x==0 && blockldx.x==0)
Then the threads in the child grid would access all the data computed by all the threads of different thread blocks in the parent grid. As all the data are stored in the global memory, the child grid can access all the data with the address.
But does it mean that the child grid could make sure all the data have been computed by all the threads of different thread blocks in the parent grid?
I check the CUDA programming guide about the dynamic parallelism. It says "Since thread
0 of the parent is performing the launch, the child will be consistent with the memory
seen by thread 0 of the parent. Due to the first __syncthreads() call, the child will see
data[0]=0, data[1]=1, …, data[255]=255 (without the __syncthreads() call, only
data[0] would be guaranteed to be seen by the child). "
From the quoting, it is not possible for the child grid to see the data computed by other threads of the same thread block, which has the launched thread. Thus, it is certainly not possible for the child grid to see the data computed by threads of other thread blocks.
But the example shows that the child grid could see the data computed by all the threads in the parent grid and ensure all the data have been computed by all the threads in the parent grid.
Thanks a lot.
Shanshan