Multi-stream graph

joseff · February 5, 2025, 5:28pm

Hello, I’m trying to create a CUDA Graph that involves multiple streams with dependencies. Specifically:

Stream 1 launches a sequence of kernels.
Other streams also execute kernels but must wait for each corresponding kernel in Stream 1 to complete before proceeding.
I can achieve this without CUDA Graphs, and with multiple graphs per stream, but I’m wondering if I can have just one graph per stream, in hopes of minimising overhead.

Questions:

Can a single CUDA Graph contain multiple streams, with kernels in one stream waiting on another?
If not, can I use multiple graphs (one per stream) and enforce synchronization between them?
Why do I get EndCapture error: capturing stream has unjoined workin my following attempt at a minimal example

#include <cstdio>
#include <cuda_runtime.h>

// A trivial kernel that just prints its identifier
__global__ void minimalKernel(int streamId) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx == 0) {
    printf("streamId %d\n", streamId);
  }
}

int main() {
  cudaError_t err;

  err = cudaDeviceSynchronize();
  if (err != cudaSuccess) {
    fprintf(stderr, "Pre-capture sync error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  cudaStream_t s1, s2;
  err = cudaStreamCreate(&s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream create error s1: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaStreamCreate(&s2);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream create error s2: %s\n", cudaGetErrorString(err));
    return 1;
  }

  err = cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
  if (err != cudaSuccess) {
    fprintf(stderr, "BeginCapture error on s1: %s\n", cudaGetErrorString(err));
    return 1;
  }

  minimalKernel<<<1, 1, 0, s1>>>(1);
  minimalKernel<<<1, 1, 0, s2>>>(2);

  cudaEvent_t evt;
  err = cudaEventCreateWithFlags(&evt, cudaEventDisableTiming);
  if (err != cudaSuccess) {
    fprintf(stderr, "Event create error: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaEventRecord(evt, s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "Event record error: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaStreamWaitEvent(s2, evt, 0);
  if (err != cudaSuccess) {
    fprintf(stderr, "StreamWaitEvent error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  minimalKernel<<<1, 1, 0, s2>>>(3);

  cudaGraph_t graph = nullptr;
  err = cudaStreamEndCapture(s1, &graph);
  if (err != cudaSuccess) {
    fprintf(stderr, "EndCapture error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  cudaGraphExec_t exec;
  err = cudaGraphInstantiate(&exec, graph, nullptr, nullptr, 0);
  if (err != cudaSuccess) {
    fprintf(stderr, "cudaGraphInstantiate error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  err = cudaGraphLaunch(exec, s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "cudaGraphLaunch error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  err = cudaStreamSynchronize(s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream sync error s1: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaStreamSynchronize(s2);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream sync error s2: %s\n", cudaGetErrorString(err));
    return 1;
  }

  // Cleanup
  cudaEventDestroy(evt);
  cudaGraphDestroy(graph);
  cudaGraphExecDestroy(exec);

  cudaStreamDestroy(s1);
  cudaStreamDestroy(s2);

  fprintf(stderr, "Multi-stream capture succeeded!\n");

  return 0;
}

striker159 · February 5, 2025, 6:04pm

Graphs and streams are two different concepts. A cudaGraph models the dependencies between work items. Yes, a cudaGraph can contain nodes which could be executed simultaneously. The workflow in the graph could then be implemented internally using streams.

For stream capture, all work forked from the recording stream must be joined back to that stream before ending the stream capture. More specifically, after minimalKernel<<<1, 1, 0, s2>>>(3); you need to record an event in stream s2 and wait on it in stream s1.

Robert_Crovella · February 5, 2025, 6:14pm

With API graph definition, dependencies that would have depended on another stream, or concurrency that would have been possible with multiple streams, can both be readily represented in a single graph. That is a fundamental concept of graphs. The work that you might have represented/expressed with multiple streams can be represented/expressed in a single graph.

Topic		Replies	Views
Multiple independent streams in a graph CUDA Programming and Performance	2	2460	October 7, 2019
Multistream in cudagraph capturing CUDA Programming and Performance	1	502	February 6, 2025
What will happen when I replay a cuda graph with two streams in a new stream? CUDA Programming and Performance	9	806	May 24, 2024
Using multi streams in cuda graph, the execution order is uncontrolled CUDA Programming and Performance cuda	11	3522	August 23, 2023
Why cudaGraphLaunch(graph_exec_, stream1) dont run the graph at stream1 CUDA Programming and Performance cuda , graphics	1	102	June 6, 2025
CUDA Graph capture - work on separated streams invalidates graph capture CUDA Programming and Performance	5	733	May 1, 2025
Questions about CUDA graph CUDA Programming and Performance cuda	1	498	October 2, 2020
Prohibited and Unhandled Operations in CUDA graphs CUDA Programming and Performance	1	600	March 7, 2023
cudaGraph Stream Capture CUDA Programming and Performance cuda	1	708	August 15, 2023
Regarding where to place the api cudaStreamSynchronize() while looping Jetson AGX Orin cuda	1	161	June 26, 2024

Multi-stream graph

Questions:

Related topics