Is dynamic parallelism suitable for this application?

I am writing some cuda code for an application that basically first reads some data from global memory, do some simple calculation and generate some data A. Then data A (small) is used to do some more calculation that requires a lot more memory reads. Since the two steps are very different and they don’t work well in one single kernel as it will require more registers, I am thinking of spliting the kernel into two. But this will require that the first kernel to write data A into global memory so that the second step can use it. If one single kernel is used, this extra writes and reads are not required.

I am wondering if this application is suitable for dynamic parallelism (DP). I read the guide for DP but could not find the implementation detail on it. When the parent calls the child, will the parent still reside in the SM and occupying resources? If yes, I don’t see it will get any performance improvement. If not, will it incur some context switch overhead?

I have a minimal experience with dynamic parallelism (DP).

My understanding is that (correct me if I’m wrong) one limitation of DP is related to the maximum number of threads that can run concurrently on a GPU. For example, if each thread of a parent kernel running 100 threads launches a kernel running 100 threads, you quickly end up with a large number of threads (10000 in this case) running on the device. So you have somehow to ensure that the child kernels runs a properly limited number of threads.

I would say that DP is useful when, for example, a thread should run for loops calculating independent results. In this case, DP would add a further degree of useful parallelism.

As an example, I have successfully used dynamic parallelism for an interpolation problem of the form:

int i = threadIdx.x + blockDim.x * blockIdx.x;

for(int m=0; m<(2*K+1); m++) {

    PP1 = calculate_PP1(i,m);
    phi_cap1 = calculate_phi_cap1(i,m);  

        for(int n=0; n<(2*K+1); n++) {

            PP2 = calculate_PP2(i,m);
            phi_cap2 = calculate_phi_cap2(i,n);

            atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }

where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel. Furthermore, the number of loop cycles, corresponding to the number of threads in the child kernel, is limited to 13x13.

Finally, and of course, you have to account that the launch of child kernels will cost some overhead.

If the parent launches the child kernel then calls cudaDeviceSynchronize to wait for the children to complete then the parent kernel will likely be removed from the SM freeing resources to guarantee the children make forward progress. If the parent does not need to wait for the children to complete then the parent kernel will complete its device code but the kernel itself will not be marked as completed until all of the children have completed.

Hi Greg,
Why does the parent have to call cudaDeviceSynchronize? is it suffice enough to call cudaStreamSynchronize?

The reason I ask is what if I have a few unrelated tasks, running on different streams and use Dynamic parallelism? can I use stream sync instead of holding the other streams when using cudaDeviceSynchronize?