A question on nested parallelism

Hello everyone,

I’m a newbie at CUDA programming, but I need use it in a complex project. I really need some help.

My question is if I want to execute a child kernel 256 times concurrently what can I do with Dynamic Parallelism?

I read a blog https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/, and it says:“By default, grids launched within a thread block are executed sequentially: the next grid starts executing only after the previous one has finished. This happens even if grids are launched by different threads within the block.”
So, my idea is setting block size(1,1) and grid size(256,1) for the parent kernel and I can launch the child kernel concurrently with 256 threads in different blocks. Will it be very inefficient? What’s the better solution?

Thank you for your help!

I wouldn’t ordinarily recommend cuda dynamic parallelism to CUDA newbies.

Furthermore, there are a lot of questions already on public forums discussing various aspects of CUDA dynamic parallelism.

It’s impossible to discuss efficiency without knowing the workload and additional details you haven’t provided. The launch sequence does not address efficiency. It’s also not possible to describe the better solution with the problem description you’ve given so far (“a complex project”).

As Robert Crovella says, CUDA dynamic parallelism is an advanced topic not really suitable for self-proclaimed newbies.

Keep in mind that it has been found again and again that launching kernels from device code is not any more efficient than launching them from the host, i.e. both incur the same overhead. The only problems I am personally aware of that benefit from dynamic parallelism are those that required local flexibility based on data encountered at run-time, e.g. local re-meshing with a finer mesh to preserve accuracy when particular events happen or sudden changes to functional properties are encountered.

Exemplary real-life example presented on the NVIDIA developer blog: https://devblogs.nvidia.com/a-cuda-dynamic-parallelism-case-study-panda/

Sorry, it’s my first time to ask a question in this forum. Let me be more specific.

I have a data matrix which size is (512,1024) . I need do an operation on every row pair, and I want this operation to be executed 256 times concurrently to save time. This operation contains calculating FFT(or convolution), getting the index of maximum value in the FFT result and shift the data in its original location.(Actually, this process is called range alignment in Inverse Synthetic Aperture Radar imaging).

As I mentioned before, my idea is setting block size(1,1) and grid size(256,1) for the parent kernel and I can launch the child kernel ,which is used to complete the operation, concurrently with 256 threads in different blocks.

It’s difficult for me to design a right and efficient parallel program. Could you give me some advice?

I’m not native English speaker, so I don’t know if you can understand me clearly.


It’s difficult for me too. Based on what you’ve said so far, I wouldn’t start my design approach with CDP (CUDA Dynamic Parallelism).

  1. Because its hard for me to write efficient parallel programs, I like to use well-engineered libraries whereever possible. FFTs are a good example. I don’t want to write my own FFT. And if you launch a child kernel from a parent kernel, the only way you will get an FFT done in that child kernel that I know of is to write your own. On the other hand, if I just use the CUFFT library to perform the FFT, I can get all 256 FFTs done in a single CUFFT library call.

  2. Getting the index of the maximum value (I think that goes by the name argmax) is also commonly available in libraries, and I would consider using a library like thrust to do that. It is a parallel reduction operation.

  3. The final shift/movement of data might be something that I would write a CUDA kernel for.

Each of the operations above 1,2,3 would operate on your entire data set in a single call. (So, a total of 3 calls.) This should allow you to create sufficient exposed parallelism to saturate the machine, which is one of the top 2 performance priorities of any CUDA programmer. For a large data set, the overhead of breaking this into 3 operations/calls should not be significant compared to the processing time.

Generally speaking, taking an operation that works on 512,000 data points, and breaking it into 256 kernel calls that each operate on 2048 data points, is not more efficient in CUDA. It is less efficient.

Robert, thanks for your help! I think I need to reconsider this problem.