How does CUDA dynamic parallelism reduce CPU-GPU communication?

I find it difficult to understand how CUDA dynamic parallelism helps in reducing CPU-GPU communication. As I understand, GPU threads can be launched from the CPU in a single call in a situation as follows (A matrix addition example)

add_matrices<<<grid, block>>>(ad, bd, cd, N);

All the GPU threads to add two matrices can be launched from the above line of code. Can we use dynamic parallelism to speedup the above example? Or performance gain in dynamic parallelism is possible in situations like quickSort where details about required threads are unknown?

Can someone provide a clear explanation about how performance changes by adapting dynamic parallelism?

cross posting:

blog article: