How to verify that high priority stream is served

Hi,
I’m using GeForce RTX 3050 for real time signal processing (cuda 12.0) and using NSight system 2022 for profiling

I have 2 CPU Threads running two CUDA Streams

The high priority stream (A) receive input every 10ms and performs the work in 3ms

The Low priority stream (B) receive input every 1000ms and performs the work in 300ms using several kernels

The problem is that when the low priority stream (B) is launched, its blocks the high priority stream (A) for the entire 300ms (B) is running

I tried using cuda stream priority - it does help
What I looking for is something similar to preemption

How do I make sure that every input in (A) is served within few ms latency ?

I am getting the expected “context switch” changing the execution from stream (B) to stream (A) when I am adding “stream synchronize” calls along the stream (B) kernel calls

But this is not a good solution:

  1. stream (B) lose its asynchronous execution behavior
  2. I can not and do not want to put “stream synchronize” in (B) every 10ms

any other idea?

stream priority is not guaranteed to use preemption.

some ideas:

  1. design the code of the kernels in stream B so that the blocks are constantly cycling - this will open up block scheduling space for blocks from stream A to get scheduled in a priority fashion. In a nutshell you do this using the opposite of grid-stride loops - make your kernel blocks be a large number and each block does a small amount of work. You improve the latency of response to kernel A by making the duration of blocks in stream B short.

  2. Or alternatively, design the code for kernels in stream B such that they do not use full occupancy on the GPU in question. Underutilize the GPU in stream B, so that when kernels in stream A come along, there is “empty space” waiting for them.

  3. Or alternatively, use an MPS server, and use the facilities in MPS server to limit resources for a client for Stream B. This involves a more radical refactoring because you will now need to use multiple clients, and therefore you will probably need to use inter-process communication. The net effect is the same as item 2. You are leaving “empty space” on the GPU so that blocks from stream A can be deposited more-or-less immediately.

Coupling a more powerful GPU with one or more of the above suggestions may also help. Similarly, using 2 GPUs should give an improvement in processing, most likely.

1 Like

Thanks Robert for your answer
The solution of using stronger GPU or multi GPU is ok for some systems, but my smallest system is a single-weak GPU.

The question of priority/preemption is very important for me, as we will add additional tasks to the system

Can you explain me, what should I expect:
If I’m starting a low priority kernel (B) with 1000 blocks, and shortly after the high priority kernel (A) starts.

Does the GPU finish the current executing blocks of (B) or It has to finish all the 1000 blocks of (B) before preemption?

Can you point me to a good resource / tutorial that explains the design / philosophy for running multi kernels concurrently and with priorities on NVIDIA GPU ?

Sorry, I don’t know of a resource that focuses strictly on that.

First of all, I wouldn’t “expect” preemption. It’s not guaranteed. That is why the word “may” appears. I don’t know all the cases where preemption may or may not get used, but one case I’m fairly confident it won’t get used in the case of cc5.x devices or below. And I won’t be able to answer further questions about “what about this case or that?” It’s not all spelled out in the CUDA documentation. If you’d like to see a change in the CUDA documentation, please file a bug. The instructions to do so are linked to a sticky post in this forum.

Next, I think its not sensible to assume that preemption makes everything work perfectly for any arbitrary code and set of requirements. Preemption is not something that can happen in a nanosecond or picosecond. It involves something like a context switch on the GPU, which is an expensive process. Therefore I doubt that the CUDA designers are just going to allow preemption to happen in any circumstance. There probably are some conditions under which it will occur and others under which it might not occur, or might not occur as quickly as you expect.

Regarding what to do, my suggestion remains the same. In the worst case scenario, without preemption, you are left with “cooperative” scheduling by the CWD or block scheduler. Even without preemption, the block scheduler should prefer blocks from higher priority kernels over blocks from lower priority kernels. Therefore, to make this happen as smooth as possible, make the block scheduling activity happen as often as possible. This means to construct your low priority kernels in such a way that they are launching many blocks to get the work done, where each block has small resource load and short duration.

I don’t have any further advice or insight beyond that.

Robert, I truly appreciate your answers

As you explained, I am not expecting for preemption within a running block.
I do look for understanding the block scheduling behavior

All my kernels use many small blocks
the CPU runs two threads activating two streams with priorities

Still, in most scenarios, the low priority (kernels + Async Memory operations) are making high priority (kernels + Async Memory operations) wait for the low priority completion
(profiling on Nsight systems 2022)

Strange

I understand that the GPU thread block scheduler only partially (or maybe not) supports real time applications

At the bottom line we would like “short execution high priority” streams to be fully executed and with minimal latency and postponing the “long execution low priority” streams - This is not the case on a GPU

The GPU thread block scheduler is using round robin policy, any implicit or explicit synchronization may/will cause the scheduler to stop the current high priority stream and switch to low priority streams

maybe the MPS (on Linux only) can resolve this issue (and only between processes) - Haven’t check it.

For real time developers using GPU - I found this resource for introduction to the topic:

1 Like

hello , i have two qustion. what does “constantly cycling” means in ideas 1? and there are more detailed information about ideas 1 and idea2 ?

I explained what it meant:

This means that each block runs for a relatively short period of time, giving the block scheduler many opportunities to schedule blocks from higher priority streams, if needed.

I don’t have anything further to share.

I have two questions about priority stream in MPS @Robert_Crovella

  1. If launching kernel A and kernel B in low priority stream, and now kernel A is running. Meanwhile, launching kernel C and kernel D in high priority stream, what is the finishing order.
  2. If launching kernel A and kernel B in high priority stream, and now kernel A is running. Kernel A is a very small kernel. Meanwhile, launching kernel C in low priority stream. Can kernel A and kernel C run in the same time?

For applications, where you are not happy about the (non-)guarantees or behaviour of the scheduler, you can try to emulate some of it manually. E.g. a flexible kernel, which can do any kind of work A, B, C, D, depending on a work package in memory. Or not invoking the kernels into the queue, but doing the decision which last-microsecond on the CPU. Or keeping a work kernel continually running on the GPU.

Those methods also have some overhead or added latency, but you can define them, as you need for the application.

None of my comments here in this thread about priority streams pertain to MPS. In my comments here about MPS, I said it was an alternative technology to consider. In my usage there, alternative means “separate and different”