Cuda Dynamic Parallelism Launch Overhead

tnallen · March 16, 2017, 8:46pm

Several years ago there was a post confirming the average CUDA kernel launch latency.
https://devtalk.nvidia.com/default/topic/549674/cuda-programming-and-performance/overhead-between-two-successive-kernel-calls/post/3845966/#3845966

I was wondering if there is any updated information on this value and possible similar information for launches using dynamic parallelism?

Thanks

grynet · March 16, 2017, 9:51pm

follow up question: is calling single kernel at device slower than calling single kernel from host ?

njuffa · March 16, 2017, 10:13pm

I have not measured it but my understanding is that a kernel launch from the device uses the same hardware path as a kernel launch from the host; therefore the launch overhead is nearly identical (eliminating the PCIe latency should make a measurable, but small difference). The advantage of dynamic parallelism is increased flexibility.

grynet · March 16, 2017, 10:32pm

I think, when you lunch single kernel from device, basically SM starts the operation. For my understanding, it uses different hardware path. If there is solid information about it, it’d be so useful.

Robert_Crovella · March 16, 2017, 10:49pm

According to my testing, I agree with njuffa. Casual testing in either case suggests a minimum launch latency (for typical launch sequence) of a few microseconds.

tnallen · March 17, 2017, 3:35am

So you would assume it would be normal CUDA launch process minus 1-2 microseconds? Does anyone have sources that say it uses the same hardware path as host launch?