Dear CUDA masters and NVIDIA people,
Another very important question is at hand. I have noticed a modest drop in my application speed when I built the source code with relocatable device code (about 20%, if not more). Even some minuscule trivial kernels with no device function calls had increased kernel times.
The thing is, there is a problem where dynamic parallelism is the most optimal solution, however, it can’t be introduced without rdc. I very much like the idea of launching kernels from the GPU based on some condition, but it simply cannot be done without this. (the workaround is copying the “flags” on the CPU, checking them there and then launching the kernels, but we want to avoid the CPU-GPU communication at any means necessary, since it is a bottleneck)
Is there any hope that linking will be optimized in the CUDA compilation trajectory any time soon so that rdc won’t create any additional overhead?
And if not, can we expect then, that we will never be able to launch kernels from the GPU and that CPU-GPU communication is unavoidable?