Limit number of (or allocate) SM on a per stream basis

simon.huckett · February 28, 2019, 10:06am

I have a real-time application which uses a number of CUDA streams. The large majority of the compute is handled by long running low-priority streams with kernels that require a large number of blocks to be processed. This is highly optimised and keeps the GPU busy. Alongside this, I have a high-priority stream that requires many iterations of much smaller GPU kernels alongside a significant amount of host-side compute required to work out what to schedule on the GPU.

The problem is that although the overall compute both GPU and CPU is well within budget, I often experience overrun issues as the high priority fails to complete. Running low-priority first, then high priority is not an option due to host compute. I want to hide the host compute whilst the low-priority GPU kernels run.

Analysing the profile generated by nvprof shows that the higher priority stream is not being scheduled between kernels on the low-priority stream as often as expected. Said differently, when the GPU is fully populated by (multiple) kernels on the low priority stream, the work is not always preempted as expected.

The current solution we have is to break the kernels on the low-priority streams into small enough blocks that do not fully populate the GPU. Then we see the high priority work being scheduled as soon as it is ready, but I suspect that another low-priority stream would also get to run as the GPU has spare capacity.

Why doesn’t the GPU schedule the high priority stream between kernels on the low priority streams as expected? (the exact comment is " Work in a higher priority stream may preempt work already executing in a low priority stream." - why “may” and not “always”)?

A much cleaner solution to this than breaking up all the kernels into smaller chunks would be to limit or specify the number of (and range of) SM available for a given stream leaving SM unused / free to be used by another stream as soon as work arrives. For example on a 1080ti, I’d like to allocate 24 SM for lower-priority threads and leave 4 SM for high priority.

Is there a way to achieve this without using persistent threads/kernels?

erez.eyal1 · February 27, 2023, 7:10am

This a very important scenario, I have exactly the same scenario

anyone has information regarding this problem of how to force
high priority stream preemption or to use a SMs allocation per stream?

amirfakhimbabaei · November 14, 2023, 7:00am

Did you find a way?

Why NVIDIA is not addressing this? This could be very useful for Real-Time schedulers. AMD has a similar mechanism for streams but NVIDIA, the giant of Deep Learning and GPGPUs does not support it!

Robert_Crovella · November 14, 2023, 3:02pm

The partitioning of SMs can be achieved using MPS resource provisioning. This implies that the work be broken into separate processes. For a single process, the only methodologies are the ones already mentioned, the primary one being stream priorities. (Another effective method is probably to use 2 or more GPUs.) I have made suggestions about how to use stream priorities to give best progress to the high priority stream here. I don’t have any further suggestions. It’s quite possible these suggestions don’t address every case.

Topic		Replies	Views
How to verify that high priority stream is served CUDA Programming and Performance	12	2294	April 24, 2025
How high priority stream preemption CUDA Programming and Performance	12	7079	November 30, 2022
cuda stream high priority could not always schedule high prority CUDA Programming and Performance	2	804	July 11, 2019
Questions of CUDA stream priority CUDA Programming and Performance cuda	10	4639	April 19, 2023
Cuda stream priorities Inference Town Hall 7-24-25 cuda	0	50	July 17, 2025
Scheduling of kernels CUDA Programming and Performance	0	354	January 2, 2023
Resource partitioning in Streams CUDA Programming and Performance	4	829	December 4, 2019
Blocking scheduler - Question about the priority of scheduling kernel blocks on concurrent streams CUDA Programming and Performance	2	421	September 8, 2023
Priority of concurrent CUDA kernel execution on TX1 Jetson TX1	5	1445	October 18, 2021
Running CUDA kernels from two different pthreads CUDA Programming and Performance	7	3031	May 10, 2016

Limit number of (or allocate) SM on a per stream basis

Related topics