Does saturating a stream hide kernel launch latency?

must say, for an empiricist, you ‘ponder hypotheticals’ rather with splendour

if design can negate host-launched kernel launch overhead, would it equally be the case for device-launched kernel launch overhead? that is, when using dynamic parallelism, with proper design, kernel launch overhead can still be negated?

on another but similar note, if a kernel can theoretically contain 512 million instructions, where is that stored? and i suppose it gets cached too? in the case of lengthy code, is there some sort of instruction pre-fetching, or not even?

I have not had enough exposure to dynamic parallelism to say what latency covering strategies exist or may be necessary. Dynamic parallelism only provides an indirect performance boost, for example by allowing for dynamically adaptive grids in simulations (vs a finer fixed grid) or by eliminating a GPU->CPU->GPU roundtrip for dynamic control flow (e.g. device-side BLAS used as part of a solver). Computational tasks that do not benefit from these functional improvements are unlikely to see a speedup vs classical launch-from-host scenarios.

CUDA kernel code is stored in global memory. There is an instruction cache, which is fairly small. I do not recall its exact size, it may be 4 KB or 8 KB, please check the documentation. Some micro-architectural benchmarking should also reveal the size because loops that exceed the instruction cache size will experience some slowdown. In my recollection it is a relatively minor effect but easily reproducible and measurable. As far as I know there is instruction prefetching but no branch prediction.

you mentioned “indirect performance boost” and “speedup”; of course, not to take what you said out of context, i doubt whether my principal goal is to seek performance gains/ speed ups by blindly slapping dynamic parallelism onto my current kernel

i am very much aware of and interested in both the benefits of dynamic parallelism you mention: …“allowing for dynamically adaptive grids…” “…eliminating a GPU->CPU->GPU roundtrip…”, of course, provided the cost is not prohibitive/ significant

can i use dynamic parallelism to my advantage, to further improve my kernel/ algorithm? perhaps
what would be the overall cost/ benefit, within the context of my kernel? good question

the kepler architecture whitepaper really starts to stir dynamic parallelism, hyper-q (hardware side), and the grid management unit in the same pot, to such an extent that i cant help to begin to think that host-side and device-side kernel launches appears much the same

i have now exhausted ‘hypotheticals to ponder’. i suppose one could conduct an empirical study to prove or disprove such a hypothesis that there is no significant difference/ throughput change when launching a set of kernels from the device, compared to launching the same set from the host, and use that to make inferences about kernel launches from the device in general

changing my kernel to rely more on dynamic parallelism, in turn to see whether it would imply something positive would be no small task, that is why i am trying to comprehend the true cost as much as possible beforehand

The underlying hardware mechanism that launches a kernel is of course the same for kernels launches from host and device. So when initiating the launch, time spent in that hardware machinery will be the same. But other components of the launch overhead will be specific to the type of launch, such as the overhead of PCIe transmission which applies to launches from host, or state-saving cost which applies to launches from the device. As far as I am aware, NVIDIA does not provide a detailed break-down of launch overhead components. Thinking about the many variables that feed into that, I think it could be difficult to do so in full generality.

If you are considering using dynamic parallelism for your use case, one additional approach (which you may well have tried already) is to check whether there are publications describing how dynamic parallelism has been applied to your problem or a similar one, and what the outcome of that was. If the reported outcome was positive, that may then be justification for tackling the non-trivial cost of changing your own software. In terms of an experimental approach, you could also look into doing a feasibility study using a rough prototype or mock-up of your full application with significantly reduced complexity.