You’re close to measuring the launch overhead with these tiny kernels doing almost nothing. I suspect the difference here would be lost in the noise if your kernels executed for a few milliseconds. For example, by adding a for loop to your kernel and running it for 100 times, I get results like this:
Average time without std::this_thread::sleep_for is 0.00415706
Average time with std::this_thread::sleep_for is 0.00416839
Now the difference is less than 1%.
So we are talking about a relatively small (on the order of the kernel launch overhead) fixed cost that appears when you put a thread to sleep, to subsequently launch a kernel after the thread wakes up.
Since CUDA has a lazy initialization process, it wouldn’t surprise me that there is some additional resource initialization time to make the CUDA runtime usable again, after a thread goes to sleep and wakes up.
I have no idea what is happening, really, that is just idle speculation. But generally good advice for using CUDA is to try to avoid launching kernels that are a few microseconds or a few tens of microseconds in duration. If you do so, even independent of your observation here, the cost to launch a kernel becomes a significant part of your overall workflow, and thus you are using the GPU inefficiently.
You can file a bug if you wish, but my guess is that:
- this is ultimately expected behavior
- its unlikely that significant resource would be applied to try to improve this situation, because you are using the GPU inefficiently (even if this were “fixed”).