Running expensive kernels without impeding other kernels


I was trying to implement a kernel such that it uses only a small part of the GPU. In particular, the kernel launches in a separate stream with only a single block and a single warp. I hoped the kernel could simply “run in background”, since it occupies only one warp slot on a SM and the rest of the GPU is free to use. The reason is that the computation time of this kernel is way too long (also time doesn’t matter here) and it would impede other kernels which have high priority and are somewhat time-critical.
However, in contrast to my expectation, running the kernel heavily degrades the performance of the whole system, as if it was taking up all resources (even the window manager has a hard time).

Why is it behaving like that and is there a better approach to solve this issue?

Example code is appended. (3.2 KB)

Thanks in advance!

My guess is you are running this on windows WDDM. Switch to Windows TCC mode or switch to a linux GPU that is not also driving a graphical display. When any CUDA kernel is running on the GPU, regardless of its size/complexity/resource utilization, the GPU is distracted from other tasks, such as handling windows graphics tasks (or linux graphics tasks).

In the old days, kernels in such a situation had a limited duration, because when the kernel was running, the display manager was not, and this is inherently bad for a GUI if it persists. Modern GPUs use preemption to switch between updating the display and letting your CUDA kernel run, but this is still not perfectly efficient. The preemption and context-switching process that the GPU must follow to go from servicing CUDA to servicing the display is going to be a noticeable tax on the system. Again, this behavior, and the disruption from it, is largely independent of the size of your kernel or what exactly your kernel is doing.

So, if this tax on the system is a problem, then the best advice I know of is to move the workload from the GPU that is servicing the display, to one that isn’t.

Another alternative might be to break your work up into pieces, and issue the “background” kernel for short durations. This might be better than letting the kernel run continuously and depend on preemption to sort it out. I don’t have a good idea whether this would be any better. You might have to experiment with it.