Using multi streams in cuda graph, the execution order is uncontrolled

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute on stream1, but with nsys I found kernelB is executed on a complete new stream. It is under-control. The scheduling graph is showed below:

Here is my code

#include <iostream>

__global__ void kernelA() {}
__global__ void kernelB() {}
__global__ void kernelC() {}

int main() {
  cudaStream_t stream1, stream2;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);

  cudaGraphExec_t graphExec = NULL;
  cudaEvent_t event1, event2;
  cudaEventCreate(&event1);
  cudaEventCreate(&event2);

  for (int i = 0; i < 10; i++) {
    cudaGraph_t graph;
    cudaGraphExecUpdateResult updateResult;
    cudaGraphNode_t errorNode;
    cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);
    kernelA<<<512, 512, 0, stream1>>>();
    cudaEventRecord(event1, stream1);
    cudaStreamWaitEvent(stream2, event1, 0);
    kernelB<<<256, 512, 0, stream1>>>();
    kernelC<<<16, 512, 0, stream2>>>();
    cudaEventRecord(event2, stream2);
    cudaStreamWaitEvent(stream1, event2, 0);
    cudaStreamEndCapture(stream1, &graph);
    if (graphExec != NULL) {
      cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult);
    }
    if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) {
      if (graphExec != NULL) {
        cudaGraphExecDestroy(graphExec);
      }
      cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    }
    cudaGraphDestroy(graph);
    cudaGraphLaunch(graphExec, stream1);
    cudaStreamSynchronize(stream1);
  }
}

I am wondering the following questions for cuda graph:

  1. Maybe cuda graph will automatically put independent tasks to different streams, so the procedure is different from what I defined in capture code. Is there any way for me to control the procedure?
  2. Is there some cost analyzations inside cuda graph to select the right task to different stream which will maximum the overall performance? Since different kernel scheduling may lead to different performance especially in deep learning inference or training.
1 Like

Hi, thanks for your reply, I just want to know whether there are some cost analyzations inside cuda graph to select the right task to different stream which will maximum the overall performance.

CUDA graphs use streams to arrange for concurrency and asynchrony. You can control dependencies. This control is most obvious if you use the API capture method, but if you use the stream capture method, the dependencies will still be defined at that point.

No graph item will execute before its dependencies are complete. Other than that, CUDA graphs will attempt to schedule work efficiently to maximize performance, and you have no direct control over this scheduling.

Let’s say we have a graph item B that is dependent on A, and a graph item C that is also dependent on A. CUDA graphs will use streams (generally speaking) to allow both B and C to execute as quickly as possible, after A is complete.

Regarding your question 2, you don’t have control over the detailed scheduling of activity, other than declaring dependencies.

Thanks a lot. I am wondering if there are some available documents about the inside scheduling policy of multi-stream in cuda graph?

Not that I am aware of. The majority of CUDA docs are available here and the primary docs are the programming guide and the Runtime (or Driver) API reference manual. Both cover various topics related to CUDA graphs. There are also CUDA graphs blogs, such as this one and this one.

Thanks again for your kindness reply.

@Robert_Crovella
Hi! Thanks for your answers. Is there any method to assign different priorities (like stream priority) to the nodes in the graph? So that the threadblocks of nodes with low priority can only be scheduled when there is no threadblock can be schedule in the high priority nodes.

I actually tried the cudaGraphInstantiateFlagUseNodePriority in the stream capturing mode, but seems that is doesn’t work.

Thanks!

@Robert_Crovella - My understanding of cudaGraphInstantiateFlagUseNodePriority is to prioritize kernel calls.
i.e. we have 3 independent kernels in cudaGraph first, second & third and let’s each kernel waits for 1 s and print its name. If we set kernel graph node priority using cudaGraphKernelNodeSetAttribute (attr name - cudaLaunchAttributePriority) for each as 0, 1, 2. When graph is executed then it should honor priority i.e. third should be called followed by second and followed by first.
Another thing to note is that after setting priority or kernel graph node, If I try to confirm using cudaGraphKernelNodeGetAttribute (attr name - cudaLaunchAttributePriority) , I always get priority as 0. It should return same value set by Set call as mentioned previously … Right?
Please correct if I am wrong.

Priority does not guarantee a specific ordering. If you need that guarantee, you need to add the proper dependencies between the nodes.

The allowed values for stream priority (I am assuming kernel priority uses the same range) can be queried with cudaDeviceGetStreamPriorityRange ( int* leastPriority, int* greatestPriority )
Note:

Stream priorities follow a convention where lower numbers imply greater priorities. The range of meaningful stream priorities is given by [*greatestPriority, *leastPriority]. If the user attempts to create a stream with a priority value that is outside the the meaningful range as specified by this API, the priority is automatically clamped down or up to either *leastPriority or *greatestPriority respectively

On my machine, the least priority is 0, and the greatest priority is -5. In this case, priority of 1 or 2 would be clamped to 0 which could explain that your GetAttribute call returns 0.

@striker159 - Thank you for your reply. I understand that dependencies can be use to ensure serialization however I am trying to understand what is use and behavior of cudaGraphInstantiateFlagUseNodePriority. I have created another thread for that Behavior of cudaGraphInstantiateFlagUseNodePriority - #2 by jaydeeppatel_1111

May you please have a look and share your understanding? Thanks

@striker159 - I have gone through some of articles on CUDA stream priority however there are 2 priorities in cudaGraph to my understanding.
1/ Stream priority
2/ Per node priority
And as per CUDA doc for cudaGraphInstantiateFlagUseNodePriority
Run the graph using the per-node priority attributes rather than the priority of the stream it is launched into.

Does it not meant to have priority for kernel graph node on top of stream priority on which graph is launched?

Is there CUDA sample/article on cudaGraphInstantiateFlagUseNodePriority to understand it more? I could not gather much on cudaGraphInstantiateFlagUseNodePriority.

Thank you!