Kernel launch overhead from device using dynamic parallelism

In a talk from an NVIDIA employee, I heard it said that the kernel launch overhead from the device is roughly the same as the kernel launch from the host. The speaker went on to say that if the kernel launches from the device are done in a batch of, say N, the overhead per kernel would be 1/N of the overhead had that kernel been launched from the host. What is this ‘batch’ mode? I have reason to believe it is not referring to the case where the parent kernel launches one child kernel for each thread.