Cuda Dynamic Parallelism Launch Overhead

Several years ago there was a post confirming the average CUDA kernel launch latency.
https://devtalk.nvidia.com/default/topic/549674/cuda-programming-and-performance/overhead-between-two-successive-kernel-calls/post/3845966/#3845966

I was wondering if there is any updated information on this value and possible similar information for launches using dynamic parallelism?

Thanks

follow up question: is calling single kernel at device slower than calling single kernel from host ?

I have not measured it but my understanding is that a kernel launch from the device uses the same hardware path as a kernel launch from the host; therefore the launch overhead is nearly identical (eliminating the PCIe latency should make a measurable, but small difference). The advantage of dynamic parallelism is increased flexibility.

I think, when you lunch single kernel from device, basically SM starts the operation. For my understanding, it uses different hardware path. If there is solid information about it, it’d be so useful.

According to my testing, I agree with njuffa. Casual testing in either case suggests a minimum launch latency (for typical launch sequence) of a few microseconds.

So you would assume it would be normal CUDA launch process minus 1-2 microseconds? Does anyone have sources that say it uses the same hardware path as host launch?