a question about low performance on dynamic parallelism with tremendous data

Hi guys,

I’m converting some exist GPU code to run them with the new feature dynamic parallelism. I have successfully run the code with a smaller size of data and the running time for dynamic parallelism is about a half as it was before for regular GPU code. However, when I give it a really big data, all the things changed. Not only the child functions called by father kernel function run much slower, but also the global functions who are called from HOST and totally the same as they are on old GPU code run slower too. It doesn’t make any sense. If anybody has this kind of experience before?
I’m using “nvcc -rdc=true xxxx.cu -o xxxx -lcudadevrt” to compile the code.
Please give me some advise!


Is your code using streams?
Dynamic parallelism apparently disables parallel execution of multiple kernels launched from the host side to keep the GPU free for the kernels that will be launched from the device side.

Hi, thanks for your suggestion. However, there is no stream in my code. Does it make a really big difference? What else you think may cause this kind of low efficiency?