Cuda slow performance after process sleep/wait

You’re close to measuring the launch overhead with these tiny kernels doing almost nothing. I suspect the difference here would be lost in the noise if your kernels executed for a few milliseconds. For example, by adding a for loop to your kernel and running it for 100 times, I get results like this:

Average time without std::this_thread::sleep_for is 0.00415706
Average time with std::this_thread::sleep_for is 0.00416839

Now the difference is less than 1%.

So we are talking about a relatively small (on the order of the kernel launch overhead) fixed cost that appears when you put a thread to sleep, to subsequently launch a kernel after the thread wakes up.

Since CUDA has a lazy initialization process, it wouldn’t surprise me that there is some additional resource initialization time to make the CUDA runtime usable again, after a thread goes to sleep and wakes up.

I have no idea what is happening, really, that is just idle speculation. But generally good advice for using CUDA is to try to avoid launching kernels that are a few microseconds or a few tens of microseconds in duration. If you do so, even independent of your observation here, the cost to launch a kernel becomes a significant part of your overall workflow, and thus you are using the GPU inefficiently.

You can file a bug if you wish, but my guess is that:

  1. this is ultimately expected behavior
  2. its unlikely that significant resource would be applied to try to improve this situation, because you are using the GPU inefficiently (even if this were “fixed”).
1 Like