Expected relocatable device code optimization date

Dear CUDA masters and NVIDIA people,

Another very important question is at hand. I have noticed a modest drop in my application speed when I built the source code with relocatable device code (about 20%, if not more). Even some minuscule trivial kernels with no device function calls had increased kernel times.

The thing is, there is a problem where dynamic parallelism is the most optimal solution, however, it can’t be introduced without rdc. I very much like the idea of launching kernels from the GPU based on some condition, but it simply cannot be done without this. (the workaround is copying the “flags” on the CPU, checking them there and then launching the kernels, but we want to avoid the CPU-GPU communication at any means necessary, since it is a bottleneck)

Is there any hope that linking will be optimized in the CUDA compilation trajectory any time soon so that rdc won’t create any additional overhead?

And if not, can we expect then, that we will never be able to launch kernels from the GPU and that CPU-GPU communication is unavoidable?

KR,

Garko

You’re generally unlikely to get authoritative responses to questions asking what the future holds. I’m not saying it never happens, but there are various risks for making such statements on a public forum, so in my experience authoritative answers to such questions are rare.

NVIDIA constantly works to improve the performance of CUDA, in a range of areas. One of the things that is helpful in this respect is to have well-written test cases describing patterns of interest to our customers.

If you have such a concrete, complete test case that you’d like to submit, you can do so using the bug reporting system described at the top of this sub-forum in a sticky post.

If you want to provide a test case here, others may be able to suggest ways to improve performance currently.

A couple suggestions:

  1. make sure you are not compiling a debug project
  2. you can partition a project into modules that require rdc and those that don’t. This may help with the performance of those kernels in your project that don’t depend on rdc.

Thank you for a fast reply, I understand. Unfortunately sharing the source is out of the question.

  1. I made sure that I build with the most agressive optimizations (03)

  2. That may very well work, if it is really possible to call kernels built without rdc in the kernel that was built with rdc (i have no idea how the ABI would work under the hood to be honest). I will definitely try that out!

Thank you.