Limit number of (or allocate) SM on a per stream basis

I have a real-time application which uses a number of CUDA streams. The large majority of the compute is handled by long running low-priority streams with kernels that require a large number of blocks to be processed. This is highly optimised and keeps the GPU busy. Alongside this, I have a high-priority stream that requires many iterations of much smaller GPU kernels alongside a significant amount of host-side compute required to work out what to schedule on the GPU.

The problem is that although the overall compute both GPU and CPU is well within budget, I often experience overrun issues as the high priority fails to complete. Running low-priority first, then high priority is not an option due to host compute. I want to hide the host compute whilst the low-priority GPU kernels run.

Analysing the profile generated by nvprof shows that the higher priority stream is not being scheduled between kernels on the low-priority stream as often as expected. Said differently, when the GPU is fully populated by (multiple) kernels on the low priority stream, the work is not always preempted as expected.

The current solution we have is to break the kernels on the low-priority streams into small enough blocks that do not fully populate the GPU. Then we see the high priority work being scheduled as soon as it is ready, but I suspect that another low-priority stream would also get to run as the GPU has spare capacity.

Why doesn’t the GPU schedule the high priority stream between kernels on the low priority streams as expected? (the exact comment is " Work in a higher priority stream may preempt work already executing in a low priority stream." - why “may” and not “always”)?

A much cleaner solution to this than breaking up all the kernels into smaller chunks would be to limit or specify the number of (and range of) SM available for a given stream leaving SM unused / free to be used by another stream as soon as work arrives. For example on a 1080ti, I’d like to allocate 24 SM for lower-priority threads and leave 4 SM for high priority.

Is there a way to achieve this without using persistent threads/kernels?

2 Likes

This a very important scenario, I have exactly the same scenario

anyone has information regarding this problem of how to force
high priority stream preemption or to use a SMs allocation per stream?

Did you find a way?

Why NVIDIA is not addressing this? This could be very useful for Real-Time schedulers. AMD has a similar mechanism for streams but NVIDIA, the giant of Deep Learning and GPGPUs does not support it!

The partitioning of SMs can be achieved using MPS resource provisioning. This implies that the work be broken into separate processes. For a single process, the only methodologies are the ones already mentioned, the primary one being stream priorities. (Another effective method is probably to use 2 or more GPUs.) I have made suggestions about how to use stream priorities to give best progress to the high priority stream here. I don’t have any further suggestions. It’s quite possible these suggestions don’t address every case.