Limit number of (or allocate) SM on a per stream basis

I have a real-time application which uses a number of CUDA streams. The large majority of the compute is handled by long running low-priority streams with kernels that require a large number of blocks to be processed. This is highly optimised and keeps the GPU busy. Alongside this, I have a high-priority stream that requires many iterations of much smaller GPU kernels alongside a significant amount of host-side compute required to work out what to schedule on the GPU.

The problem is that although the overall compute both GPU and CPU is well within budget, I often experience overrun issues as the high priority fails to complete. Running low-priority first, then high priority is not an option due to host compute. I want to hide the host compute whilst the low-priority GPU kernels run.

Analysing the profile generated by nvprof shows that the higher priority stream is not being scheduled between kernels on the low-priority stream as often as expected. Said differently, when the GPU is fully populated by (multiple) kernels on the low priority stream, the work is not always preempted as expected.

The current solution we have is to break the kernels on the low-priority streams into small enough blocks that do not fully populate the GPU. Then we see the high priority work being scheduled as soon as it is ready, but I suspect that another low-priority stream would also get to run as the GPU has spare capacity.

Why doesn’t the GPU schedule the high priority stream between kernels on the low priority streams as expected? (the exact comment is " Work in a higher priority stream may preempt work already executing in a low priority stream." - why “may” and not “always”)?

A much cleaner solution to this than breaking up all the kernels into smaller chunks would be to limit or specify the number of (and range of) SM available for a given stream leaving SM unused / free to be used by another stream as soon as work arrives. For example on a 1080ti, I’d like to allocate 24 SM for lower-priority threads and leave 4 SM for high priority.

Is there a way to achieve this without using persistent threads/kernels?