I’m converting some exist GPU code to run them with the new feature dynamic parallelism. I have successfully run the code with a smaller size of data and the running time for dynamic parallelism is about a half as it was before for regular GPU code. However, when I give it a really big data, all the things changed. Not only the child functions called by father kernel function run much slower, but also the global functions who are called from HOST and totally the same as they are on old GPU code run slower too. It doesn’t make any sense. If anybody has this kind of experience before?
I’m using “nvcc -rdc=true xxxx.cu -o xxxx -lcudadevrt” to compile the code.
Please give me some advise!