Performance slowdown in Titan V in the presence of Dynamic Parallelism code.

I am having the following performance problem with CUDA. When I run a simple sample code on a Titan V and Titan X card, the running times are fine as expected.

Titan X: 0.269299 ms
Titan V: 0.111766 ms
Now, when I add another kernel in the code, which uses dynamic parallelism, but still do not call it or use it at all, the performance in Volta GPU goes down drastically but on other cards the performance is not affected.

Titan X: 0.270602 ms
Titan V: 1.999299 ms
It is important to put emphasis on the fact that this second kernel is not used at all, it just sits next to the rest of the code, i.e., it is only compiled with the rest of the code. One can also comment the recursive kernel calls along with the stream creation, and see that the running times for Volta become good again. I suspect that the presence of dynamic parallelism has a negative effect on the code, even when it is not used at all ar runtime. Any ideas on how to approach this problem?

Since I don’t have a Turing or Volta card, I can’t say from personal experience, but it seems there is something going on:

[url]https://devtalk.nvidia.com/default/topic/1042488/cuda-programming-and-performance/profiling-debugging-tools-don-t-support-cuda-dynamic-parallelism-on-volta-and-turing-/[/url]