How to verify that high priority stream is served

erez.eyal1 · February 23, 2023, 9:32am

Hi,
I’m using GeForce RTX 3050 for real time signal processing (cuda 12.0) and using NSight system 2022 for profiling

I have 2 CPU Threads running two CUDA Streams

The high priority stream (A) receive input every 10ms and performs the work in 3ms

The Low priority stream (B) receive input every 1000ms and performs the work in 300ms using several kernels

The problem is that when the low priority stream (B) is launched, its blocks the high priority stream (A) for the entire 300ms (B) is running

I tried using cuda stream priority - it does help
What I looking for is something similar to preemption

How do I make sure that every input in (A) is served within few ms latency ?

erez.eyal1 · February 23, 2023, 10:37am

I am getting the expected “context switch” changing the execution from stream (B) to stream (A) when I am adding “stream synchronize” calls along the stream (B) kernel calls

But this is not a good solution:

stream (B) lose its asynchronous execution behavior
I can not and do not want to put “stream synchronize” in (B) every 10ms

any other idea?

Robert_Crovella · February 23, 2023, 3:12pm

stream priority is not guaranteed to use preemption.

some ideas:

design the code of the kernels in stream B so that the blocks are constantly cycling - this will open up block scheduling space for blocks from stream A to get scheduled in a priority fashion. In a nutshell you do this using the opposite of grid-stride loops - make your kernel blocks be a large number and each block does a small amount of work. You improve the latency of response to kernel A by making the duration of blocks in stream B short.
Or alternatively, design the code for kernels in stream B such that they do not use full occupancy on the GPU in question. Underutilize the GPU in stream B, so that when kernels in stream A come along, there is “empty space” waiting for them.
Or alternatively, use an MPS server, and use the facilities in MPS server to limit resources for a client for Stream B. This involves a more radical refactoring because you will now need to use multiple clients, and therefore you will probably need to use inter-process communication. The net effect is the same as item 2. You are leaving “empty space” on the GPU so that blocks from stream A can be deposited more-or-less immediately.

Coupling a more powerful GPU with one or more of the above suggestions may also help. Similarly, using 2 GPUs should give an improvement in processing, most likely.

erez.eyal1 · February 26, 2023, 6:01am

Thanks Robert for your answer
The solution of using stronger GPU or multi GPU is ok for some systems, but my smallest system is a single-weak GPU.

The question of priority/preemption is very important for me, as we will add additional tasks to the system

Can you explain me, what should I expect:
If I’m starting a low priority kernel (B) with 1000 blocks, and shortly after the high priority kernel (A) starts.

Does the GPU finish the current executing blocks of (B) or It has to finish all the 1000 blocks of (B) before preemption?

erez.eyal1 · February 26, 2023, 6:06am

Can you point me to a good resource / tutorial that explains the design / philosophy for running multi kernels concurrently and with priorities on NVIDIA GPU ?

Robert_Crovella · February 27, 2023, 5:57pm

Sorry, I don’t know of a resource that focuses strictly on that.

First of all, I wouldn’t “expect” preemption. It’s not guaranteed. That is why the word “may” appears. I don’t know all the cases where preemption may or may not get used, but one case I’m fairly confident it won’t get used in the case of cc5.x devices or below. And I won’t be able to answer further questions about “what about this case or that?” It’s not all spelled out in the CUDA documentation. If you’d like to see a change in the CUDA documentation, please file a bug. The instructions to do so are linked to a sticky post in this forum.

Next, I think its not sensible to assume that preemption makes everything work perfectly for any arbitrary code and set of requirements. Preemption is not something that can happen in a nanosecond or picosecond. It involves something like a context switch on the GPU, which is an expensive process. Therefore I doubt that the CUDA designers are just going to allow preemption to happen in any circumstance. There probably are some conditions under which it will occur and others under which it might not occur, or might not occur as quickly as you expect.

Regarding what to do, my suggestion remains the same. In the worst case scenario, without preemption, you are left with “cooperative” scheduling by the CWD or block scheduler. Even without preemption, the block scheduler should prefer blocks from higher priority kernels over blocks from lower priority kernels. Therefore, to make this happen as smooth as possible, make the block scheduling activity happen as often as possible. This means to construct your low priority kernels in such a way that they are launching many blocks to get the work done, where each block has small resource load and short duration.

I don’t have any further advice or insight beyond that.

erez.eyal1 · March 1, 2023, 6:48am

Robert, I truly appreciate your answers

As you explained, I am not expecting for preemption within a running block.
I do look for understanding the block scheduling behavior

All my kernels use many small blocks
the CPU runs two threads activating two streams with priorities

Still, in most scenarios, the low priority (kernels + Async Memory operations) are making high priority (kernels + Async Memory operations) wait for the low priority completion
(profiling on Nsight systems 2022)

Strange

erez.eyal1 · March 2, 2023, 8:13am

I understand that the GPU thread block scheduler only partially (or maybe not) supports real time applications

At the bottom line we would like “short execution high priority” streams to be fully executed and with minimal latency and postponing the “long execution low priority” streams - This is not the case on a GPU

The GPU thread block scheduler is using round robin policy, any implicit or explicit synchronization may/will cause the scheduler to stop the current high priority stream and switch to low priority streams

maybe the MPS (on Linux only) can resolve this issue (and only between processes) - Haven’t check it.

For real time developers using GPU - I found this resource for introduction to the topic:

657564573 · March 28, 2024, 12:49am

hello , i have two qustion. what does “constantly cycling” means in ideas 1? and there are more detailed information about ideas 1 and idea2 ?

Robert_Crovella · March 28, 2024, 5:40pm

I explained what it meant:

This means that each block runs for a relatively short period of time, giving the block scheduler many opportunities to schedule blocks from higher priority streams, if needed.

I don’t have anything further to share.

byte.xiaobin · April 24, 2025, 11:58am

I have two questions about priority stream in MPS @Robert_Crovella

If launching kernel A and kernel B in low priority stream, and now kernel A is running. Meanwhile, launching kernel C and kernel D in high priority stream, what is the finishing order.
If launching kernel A and kernel B in high priority stream, and now kernel A is running. Kernel A is a very small kernel. Meanwhile, launching kernel C in low priority stream. Can kernel A and kernel C run in the same time?

Curefab · April 24, 2025, 12:44pm

For applications, where you are not happy about the (non-)guarantees or behaviour of the scheduler, you can try to emulate some of it manually. E.g. a flexible kernel, which can do any kind of work A, B, C, D, depending on a work package in memory. Or not invoking the kernels into the queue, but doing the decision which last-microsecond on the CPU. Or keeping a work kernel continually running on the GPU.

Those methods also have some overhead or added latency, but you can define them, as you need for the application.

Robert_Crovella · April 24, 2025, 1:49pm

None of my comments here in this thread about priority streams pertain to MPS. In my comments here about MPS, I said it was an alternative technology to consider. In my usage there, alternative means “separate and different”

Topic		Replies	Views
Questions of CUDA stream priority CUDA Programming and Performance cuda	10	3825	April 19, 2023
How high priority stream preemption CUDA Programming and Performance	12	6596	November 30, 2022
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13128	March 30, 2011
Cannot get any stream parallelism. CUDA Programming and Performance	13	1293	December 31, 2019
Limit number of (or allocate) SM on a per stream basis CUDA Programming and Performance	3	1427	November 14, 2023
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1165	February 4, 2021
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	579	April 6, 2023
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1459	September 14, 2017
Multiple Streams Performance CUDA Programming and Performance	9	6397	October 19, 2010
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	1993	June 24, 2016

How to verify that high priority stream is served

Related topics