Using multi streams in cuda graph, the execution order is uncontrolled

liyuruijx · May 17, 2022, 9:58am

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute on stream1, but with nsys I found kernelB is executed on a complete new stream. It is under-control. The scheduling graph is showed below:

Here is my code

#include <iostream>

__global__ void kernelA() {}
__global__ void kernelB() {}
__global__ void kernelC() {}

int main() {
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);

  cudaGraphExec_t graphExec = NULL;
  cudaEvent_t event1, event2;
  cudaEventCreate(&event1);
  cudaEventCreate(&event2);

  for (int i = 0; i < 10; i++) {
    cudaGraph_t graph;
    cudaGraphExecUpdateResult updateResult;
    cudaGraphNode_t errorNode;
    cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);
    kernelA<<<512, 512, 0, stream1>>>();
    cudaEventRecord(event1, stream1);
    cudaStreamWaitEvent(stream2, event1, 0);
    kernelB<<<256, 512, 0, stream1>>>();
    kernelC<<<16, 512, 0, stream2>>>();
    cudaEventRecord(event2, stream2);
    cudaStreamWaitEvent(stream1, event2, 0);
    cudaStreamEndCapture(stream1, &graph);
    if (graphExec != NULL) {
      cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult);
    }
    if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) {
      if (graphExec != NULL) {
        cudaGraphExecDestroy(graphExec);
      }
      cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    }
    cudaGraphDestroy(graph);
    cudaGraphLaunch(graphExec, stream1);
    cudaStreamSynchronize(stream1);
  }
}

I am wondering the following questions for cuda graph:

Maybe cuda graph will automatically put independent tasks to different streams, so the procedure is different from what I defined in capture code. Is there any way for me to control the procedure?
Is there some cost analyzations inside cuda graph to select the right task to different stream which will maximum the overall performance? Since different kernel scheduling may lead to different performance especially in deep learning inference or training.

Robert_Crovella · May 17, 2022, 2:42pm

liyuruijx · May 17, 2022, 2:46pm

Hi, thanks for your reply, I just want to know whether there are some cost analyzations inside cuda graph to select the right task to different stream which will maximum the overall performance.

Robert_Crovella · May 17, 2022, 2:54pm

CUDA graphs use streams to arrange for concurrency and asynchrony. You can control dependencies. This control is most obvious if you use the API capture method, but if you use the stream capture method, the dependencies will still be defined at that point.

No graph item will execute before its dependencies are complete. Other than that, CUDA graphs will attempt to schedule work efficiently to maximize performance, and you have no direct control over this scheduling.

Let’s say we have a graph item B that is dependent on A, and a graph item C that is also dependent on A. CUDA graphs will use streams (generally speaking) to allow both B and C to execute as quickly as possible, after A is complete.

Regarding your question 2, you don’t have control over the detailed scheduling of activity, other than declaring dependencies.

liyuruijx · May 17, 2022, 3:08pm

Thanks a lot. I am wondering if there are some available documents about the inside scheduling policy of multi-stream in cuda graph?

Robert_Crovella · May 17, 2022, 3:11pm

Not that I am aware of. The majority of CUDA docs are available here and the primary docs are the programming guide and the Runtime (or Driver) API reference manual. Both cover various topics related to CUDA graphs. There are also CUDA graphs blogs, such as this one and this one.

liyuruijx · May 17, 2022, 3:12pm

Thanks again for your kindness reply.

chenzd15thu · February 11, 2023, 1:10am

@Robert_Crovella
Hi! Thanks for your answers. Is there any method to assign different priorities (like stream priority) to the nodes in the graph? So that the threadblocks of nodes with low priority can only be scheduled when there is no threadblock can be schedule in the high priority nodes.

I actually tried the cudaGraphInstantiateFlagUseNodePriority in the stream capturing mode, but seems that is doesn’t work.

Thanks!

jaydeeppatel_1111 · August 21, 2023, 9:57am

@Robert_Crovella - My understanding of cudaGraphInstantiateFlagUseNodePriority is to prioritize kernel calls.
i.e. we have 3 independent kernels in cudaGraph first, second & third and let’s each kernel waits for 1 s and print its name. If we set kernel graph node priority using cudaGraphKernelNodeSetAttribute (attr name - cudaLaunchAttributePriority) for each as 0, 1, 2. When graph is executed then it should honor priority i.e. third should be called followed by second and followed by first.
Another thing to note is that after setting priority or kernel graph node, If I try to confirm using cudaGraphKernelNodeGetAttribute (attr name - cudaLaunchAttributePriority) , I always get priority as 0. It should return same value set by Set call as mentioned previously … Right?
Please correct if I am wrong.

striker159 · August 21, 2023, 11:24am

Priority does not guarantee a specific ordering. If you need that guarantee, you need to add the proper dependencies between the nodes.

The allowed values for stream priority (I am assuming kernel priority uses the same range) can be queried with cudaDeviceGetStreamPriorityRange ( int* leastPriority, int* greatestPriority )
Note:

Stream priorities follow a convention where lower numbers imply greater priorities. The range of meaningful stream priorities is given by [*greatestPriority, *leastPriority]. If the user attempts to create a stream with a priority value that is outside the the meaningful range as specified by this API, the priority is automatically clamped down or up to either *leastPriority or *greatestPriority respectively

On my machine, the least priority is 0, and the greatest priority is -5. In this case, priority of 1 or 2 would be clamped to 0 which could explain that your GetAttribute call returns 0.

jaydeeppatel_1111 · August 22, 2023, 4:43am

@striker159 - Thank you for your reply. I understand that dependencies can be use to ensure serialization however I am trying to understand what is use and behavior of cudaGraphInstantiateFlagUseNodePriority. I have created another thread for that Behavior of cudaGraphInstantiateFlagUseNodePriority - #2 by jaydeeppatel_1111

May you please have a look and share your understanding? Thanks

jaydeeppatel_1111 · August 23, 2023, 8:43am

@striker159 - I have gone through some of articles on CUDA stream priority however there are 2 priorities in cudaGraph to my understanding.
1/ Stream priority
2/ Per node priority
And as per CUDA doc for cudaGraphInstantiateFlagUseNodePriority
Run the graph using the per-node priority attributes rather than the priority of the stream it is launched into.

Does it not meant to have priority for kernel graph node on top of stream priority on which graph is launched?

Is there CUDA sample/article on cudaGraphInstantiateFlagUseNodePriority to understand it more? I could not gather much on cudaGraphInstantiateFlagUseNodePriority.

Thank you!

Topic		Replies	Views
Behavior of cudaGraphInstantiateFlagUseNodePriority CUDA Programming and Performance cuda	7	776	August 23, 2023
Multi-stream graph CUDA Programming and Performance	3	597	February 5, 2025
What will happen when I replay a cuda graph with two streams in a new stream? CUDA Programming and Performance	9	728	May 24, 2024
Multiple independent streams in a graph CUDA Programming and Performance	2	2379	October 7, 2019
Why cudaGraphLaunch(graph_exec_, stream1) dont run the graph at stream1 CUDA Programming and Performance cuda , graphics	1	71	June 6, 2025
Getting Started with CUDA Graphs Technical Blog	11	2565	January 8, 2024
Multiple launches of a single cudaGraphExec_t executing in parallel, in contrast to documentation? CUDA Programming and Performance	9	971	March 18, 2022
Cuda stream priorities Inference Town Hall 7-24-25 cuda	0	46	July 17, 2025
Rtx3070 Why when I run the program with CUDA Graph and multiple streams, there's no parallelism observed when monitoring with nsys GPU-Accelerated Libraries cuda	6	119	September 18, 2025
Questions of CUDA stream priority CUDA Programming and Performance cuda	10	4569	April 19, 2023

Using multi streams in cuda graph, the execution order is uncontrolled

Related topics