I was trying to implement a kernel such that it uses only a small part of the GPU. In particular, the kernel launches in a separate stream with only a single block and a single warp. I hoped the kernel could simply “run in background”, since it occupies only one warp slot on a SM and the rest of the GPU is free to use. The reason is that the computation time of this kernel is way too long (also time doesn’t matter here) and it would impede other kernels which have high priority and are somewhat time-critical.
However, in contrast to my expectation, running the kernel heavily degrades the performance of the whole system, as if it was taking up all resources (even the window manager has a hard time).
Why is it behaving like that and is there a better approach to solve this issue?
Example code is appended. main.cu (3.2 KB)
Thanks in advance!