Multiple kernel parallel execution / GPU scheduling policy

As far as I know, kernel functions temporally share a GPU even if a kernel function cannot occupy all SMs and the rest of SMs can be allocated for one or more kernel functions.

Are there any functionalities in CUDA to run multiple kernel functions on different SMs in one GPU simultaneously rather than each kernel taking one timeslice?

I’ve tried MPS on GTX1080Ti and I got a little GPU run time improvement. But I cannot get any information about SM occupancy of each kernel and therefore I cannot ensure that the kernels occupy different SMs and run simultaneously.

Is there any document or paper describing the GPU scheduling policy?

I’m not aware of one.

If you want to have the best possibility to overlap kernels, launch them from a single process. If you want to have multiple kernels overlap from separate processes, MPS is necessary.

A profiler such as Nsight systems can identify overlap of kernels from separate processes.

Do the overlapped kernels run in parallel or a timeslice round-robin?

In that case, there is no time-slicing. Whether they run at the same time or not is a function of the capacity of the GPU and the resource utilization of the kernels.

Greetings, I’ve found a paper that proposes a model describing the CUDA scheduling policy and has referential value.

Here’s the paper bibliography information and open access download link:

Amert, Tanya, et al. “GPU scheduling on the NVIDIA TX2: Hidden details revealed.” 2017 IEEE Real-Time Systems Symposium (RTSS). IEEE, 2017.