Scheduling of kernels

Hi all,

I have a question regarding the scheduling of kernels.

I have two kernels, A and B.
Kernel A is a small one, occupying only a few Streaming Multiprocessors.
Kernel B is large, requiring many SMs (more than what is contained in the GPU).

I would like to colocate kernels A and B. The two kernels run in different CUDA streams.
By providing higher priority to the stream running kernel A, most of the time the two kernels can colocate, as I see from the NVIDIA Nsight trace.

However, at some points, as shown in the attached screenshot (with red marks), the kernels are not able to colocate.
Would anyone have any hints on why this might be happening, or how I could further investigate it?

Some info about the experiment and the attached screenshot:

  • The kernels are taken from PyTorch programs, and for test reasons, each stream schedules the same kernel for a specified number of iterations.

  • In the screenshot the kernel ‘void cudnn…’ corresponds to kernel A (small one, high priority), while the kernel ‘volta_scudnn…’ corresponds to kernel B (large one, low priority)

  • The GPU used is V100-16GB

Thank you!