Using multi streams in cuda graph, the execution order is uncontrolled

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute on stream1, but with nsys I found kernelB is executed on a complete new stream. It is under-control. The scheduling graph is showed below:

Here is my code

#include <iostream>

__global__ void kernelA() {}
__global__ void kernelB() {}
__global__ void kernelC() {}

int main() {
  cudaStream_t stream1, stream2;

  cudaGraphExec_t graphExec = NULL;
  cudaEvent_t event1, event2;

  for (int i = 0; i < 10; i++) {
    cudaGraph_t graph;
    cudaGraphExecUpdateResult updateResult;
    cudaGraphNode_t errorNode;
    cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);
    kernelA<<<512, 512, 0, stream1>>>();
    cudaEventRecord(event1, stream1);
    cudaStreamWaitEvent(stream2, event1, 0);
    kernelB<<<256, 512, 0, stream1>>>();
    kernelC<<<16, 512, 0, stream2>>>();
    cudaEventRecord(event2, stream2);
    cudaStreamWaitEvent(stream1, event2, 0);
    cudaStreamEndCapture(stream1, &graph);
    if (graphExec != NULL) {
      cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult);
    if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) {
      if (graphExec != NULL) {
      cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    cudaGraphLaunch(graphExec, stream1);

I am wondering the following questions for cuda graph:

  1. Maybe cuda graph will automatically put independent tasks to different streams, so the procedure is different from what I defined in capture code. Is there any way for me to control the procedure?
  2. Is there some cost analyzations inside cuda graph to select the right task to different stream which will maximum the overall performance? Since different kernel scheduling may lead to different performance especially in deep learning inference or training.
1 Like

Hi, thanks for your reply, I just want to know whether there are some cost analyzations inside cuda graph to select the right task to different stream which will maximum the overall performance.

CUDA graphs use streams to arrange for concurrency and asynchrony. You can control dependencies. This control is most obvious if you use the API capture method, but if you use the stream capture method, the dependencies will still be defined at that point.

No graph item will execute before its dependencies are complete. Other than that, CUDA graphs will attempt to schedule work efficiently to maximize performance, and you have no direct control over this scheduling.

Let’s say we have a graph item B that is dependent on A, and a graph item C that is also dependent on A. CUDA graphs will use streams (generally speaking) to allow both B and C to execute as quickly as possible, after A is complete.

Regarding your question 2, you don’t have control over the detailed scheduling of activity, other than declaring dependencies.

Thanks a lot. I am wondering if there are some available documents about the inside scheduling policy of multi-stream in cuda graph?

Not that I am aware of. The majority of CUDA docs are available here and the primary docs are the programming guide and the Runtime (or Driver) API reference manual. Both cover various topics related to CUDA graphs. There are also CUDA graphs blogs, such as this one and this one.

Thanks again for your kindness reply.