How does CUDA dynamic parallelism reduce CPU-GPU communication?

Pahan · August 29, 2017, 5:46pm

I find it difficult to understand how CUDA dynamic parallelism helps in reducing CPU-GPU communication. As I understand, GPU threads can be launched from the CPU in a single call in a situation as follows (A matrix addition example)

add_matrices<<<grid, block>>>(ad, bd, cd, N);

All the GPU threads to add two matrices can be launched from the above line of code. Can we use dynamic parallelism to speedup the above example? Or performance gain in dynamic parallelism is possible in situations like quickSort where details about required threads are unknown?

Can someone provide a clear explanation about how performance changes by adapting dynamic parallelism?

Robert_Crovella · August 29, 2017, 5:56pm

cross posting:

[url]https://stackoverflow.com/questions/45945111/how-does-cuda-dynamic-parallelism-reduce-cpu-gpu-communication[/url]

blog article:

[url]https://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/[/url]