Multi-stream graph

Hello, I’m trying to create a CUDA Graph that involves multiple streams with dependencies. Specifically:

  • Stream 1 launches a sequence of kernels.
  • Other streams also execute kernels but must wait for each corresponding kernel in Stream 1 to complete before proceeding.
    I can achieve this without CUDA Graphs, and with multiple graphs per stream, but I’m wondering if I can have just one graph per stream, in hopes of minimising overhead.

Questions:

  1. Can a single CUDA Graph contain multiple streams, with kernels in one stream waiting on another?
  2. If not, can I use multiple graphs (one per stream) and enforce synchronization between them?
  3. Why do I get EndCapture error: capturing stream has unjoined workin my following attempt at a minimal example
#include <cstdio>
#include <cuda_runtime.h>

// A trivial kernel that just prints its identifier
__global__ void minimalKernel(int streamId) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx == 0) {
    printf("streamId %d\n", streamId);
  }
}

int main() {
  cudaError_t err;

  err = cudaDeviceSynchronize();
  if (err != cudaSuccess) {
    fprintf(stderr, "Pre-capture sync error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  cudaStream_t s1, s2;
  err = cudaStreamCreate(&s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream create error s1: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaStreamCreate(&s2);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream create error s2: %s\n", cudaGetErrorString(err));
    return 1;
  }

  err = cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
  if (err != cudaSuccess) {
    fprintf(stderr, "BeginCapture error on s1: %s\n", cudaGetErrorString(err));
    return 1;
  }

  minimalKernel<<<1, 1, 0, s1>>>(1);
  minimalKernel<<<1, 1, 0, s2>>>(2);

  cudaEvent_t evt;
  err = cudaEventCreateWithFlags(&evt, cudaEventDisableTiming);
  if (err != cudaSuccess) {
    fprintf(stderr, "Event create error: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaEventRecord(evt, s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "Event record error: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaStreamWaitEvent(s2, evt, 0);
  if (err != cudaSuccess) {
    fprintf(stderr, "StreamWaitEvent error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  minimalKernel<<<1, 1, 0, s2>>>(3);

  cudaGraph_t graph = nullptr;
  err = cudaStreamEndCapture(s1, &graph);
  if (err != cudaSuccess) {
    fprintf(stderr, "EndCapture error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  cudaGraphExec_t exec;
  err = cudaGraphInstantiate(&exec, graph, nullptr, nullptr, 0);
  if (err != cudaSuccess) {
    fprintf(stderr, "cudaGraphInstantiate error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  err = cudaGraphLaunch(exec, s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "cudaGraphLaunch error: %s\n", cudaGetErrorString(err));
    return 1;
  }

  err = cudaStreamSynchronize(s1);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream sync error s1: %s\n", cudaGetErrorString(err));
    return 1;
  }
  err = cudaStreamSynchronize(s2);
  if (err != cudaSuccess) {
    fprintf(stderr, "Stream sync error s2: %s\n", cudaGetErrorString(err));
    return 1;
  }

  // Cleanup
  cudaEventDestroy(evt);
  cudaGraphDestroy(graph);
  cudaGraphExecDestroy(exec);

  cudaStreamDestroy(s1);
  cudaStreamDestroy(s2);

  fprintf(stderr, "Multi-stream capture succeeded!\n");

  return 0;
}

Graphs and streams are two different concepts. A cudaGraph models the dependencies between work items. Yes, a cudaGraph can contain nodes which could be executed simultaneously. The workflow in the graph could then be implemented internally using streams.

For stream capture, all work forked from the recording stream must be joined back to that stream before ending the stream capture. More specifically, after minimalKernel<<<1, 1, 0, s2>>>(3); you need to record an event in stream s2 and wait on it in stream s1.

1 Like

With API graph definition, dependencies that would have depended on another stream, or concurrency that would have been possible with multiple streams, can both be readily represented in a single graph. That is a fundamental concept of graphs. The work that you might have represented/expressed with multiple streams can be represented/expressed in a single graph.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.