I have not measured it but my understanding is that a kernel launch from the device uses the same hardware path as a kernel launch from the host; therefore the launch overhead is nearly identical (eliminating the PCIe latency should make a measurable, but small difference). The advantage of dynamic parallelism is increased flexibility.
I think, when you lunch single kernel from device, basically SM starts the operation. For my understanding, it uses different hardware path. If there is solid information about it, it’d be so useful.
According to my testing, I agree with njuffa. Casual testing in either case suggests a minimum launch latency (for typical launch sequence) of a few microseconds.
So you would assume it would be normal CUDA launch process minus 1-2 microseconds? Does anyone have sources that say it uses the same hardware path as host launch?