CUDA Graph multi-GPU performance

dejvbayer · August 4, 2023, 8:44am

Hello,

I am writting a multi-GPU accelerated simulation that consists of simple kernels computing basic per element operations and FFT implemented by extended cuFFT library. To achieve a better performance, I have decided to rewrite the application using CUDA Graph API. The main goal was to improve overhead due to context switching on large GPU systems like DGX-2 and DGX A100. However the results are not as good as expected for larger simulations.

The original application was written in a “traditional” way:

for (std::size_t step = 0ul; step < nSteps; ++step)
{
  for (int i = 0; i < nDevices; ++i)
  {
    cudaSetDevice(i);
    cudaLaunchKernel(/* params */);
  }
  ...
  cufftXtExecDescriptor(/* params */);
  ...
  /* some other computations */
}

cudaDeviceSynchronize();

The idea was to use CUDA Graph API through stream capture which is supported by cuFFT library as well. Here is an equivalent code using CUDA Graph API:

cudaGraph_t     graph;
cudaExecGraph_t execGraph;

cudaStream_t mainStream;
cudaEvent_t  mainStreamForkEvent;
cudaEvent_t  mainStreamPostFftSyncEvent;

cudaStream_t privateStream[nDevices];
cudaEvent_t  privateStreamJoinEvent[nDevices];
cudaEvent_t  privateStreamPreFftSyncEvent[nDevices];
---------------------------------------------------------------------------------
/* initialization of streams & events */

// set fft stream
cufftSetStream(mainStream);
---------------------------------------------------------------------------------
cudaStreamBeginCapture(mainStream);

// put fork event in main stream
cudaEventRecord(mainStreamForkEvent, mainStream);

for (int i = 0; i < nDevices; ++i)
{
  cudaSetDevice(i);
  
  // make private streams wait for fork event
  cudaStreamWaitEvent(privateStream[i], mainStreamForkEvent);

  // launch kernel in private streams
  cudaLaunchKernel(/* params */, privateStream[i]);

  // put pre-fft synchronization event in private streams
  cudaEventRecord(privateStreamPreFftSyncEvent[i], privateStream[i]);
  // make main stream wait for work in private stream to finish
  cudaStreamWaitEvent(masterStream, privateStreamPreFftSyncEvent[i]);
}

// execute fft
cufftXtExecDescriptor(/* params */);

// record post-fft synchronization event in master stream
cudaEventRecord(mainStreamPostFftSyncEvent, mainStream);
for (int i = 0; i < nDevices; ++i)
{
  // make private streams wait for fft to be done before launching other jobs
  cudaStreamWaitEvent(privateStream[i], mainStreamPostFftSyncEvent);
}

/* other computations in private streams */

for (int i = 0; i < nDevices; ++i)
{
  // put join events in private streams
  cudaEventRecord(privateStreamJoinEvent[i], privateStream[i]);
  // make main stream wait for jobs in private streams to finish
  cudaStreamWaitEvent(mainStream, privateStreamJoinEvent[i]);
}

cudaStreamEndCapture(mainStream, &graph);
---------------------------------------------------------------------------------
cudaGraphInstantiate(&execGraph, graph);

for (std::size_t step = 0ul; step < nSteps; ++step)
{
  cudaGraphLaunch(execGraph, mainStream)
}
cudaStreamSynchronize(mainStream);
---------------------------------------------------------------------------------
/* deinitialization */

There is a mainStream stream which is used in as the captured stream (created in context 0) and as the stream for FFT launch. Jobs on each GPU are launched in their private streams privateStream[]. There are also several events necessary for initial fork and final join of the streams and events for pre-fft and post-fft synchronization as cuFFT enables launching only in one stream. The final graph is then asynchronously launched in the mainStream.

The solution works quite well on small simulations, however in larger simulations, an overhead occures, in fact on some architectures CUDA Graph implementation is slower than the original one. Here is a measurement on 3D simulation using 4x V100 GPUs:

	128	256	384	512	640	768	896
Original	0.0618	0.3312	1.1079	2.4134	4.8074	7.6897	12.1971
CUDA Graph	0.0445	0.3464	1.1077	2.3661	4.9318	7.8198	12.3115

The results look much better on the same simulation using 4x A100 GPUs:

	128	256	384	512	640	768	896
Original	0.0612	0.1679	0.4726	0.9756	1.9140	3.3814	5.7710
CUDA Graph	0.0437	0.1403	0.4397	0.9297	1.8636	3.3172	5.6669

So my question is: how well is CUDA Graph API optimized for multi-GPU usage? Is there a better approach? Should I for example capture multiple iterations?

Also it would be really nice to get rid of the pre and post fft synchronization, cuFFT creates other private streams anyway…

Thank you very much.

David

hhoffman · August 23, 2023, 6:36pm

Looks like you are always getting speedup on the ampere case.
Ampere added more hardware features used by graphs to efficiently re-launch work.

It seems that in many cases, graphs are giving you a good speedup. The cases where there is more overhead than speedup seem to have a small negative impact <3%?

I guess the question is: do you have enough Volta machines running enough of the workloads that suffer by 3% to make it worth maintaining the streams only path / figuring out where the break even points are?

WRT: “how well is CUDA Graph API optimized for multi-GPU usage?” Only the dependencies on a given device are really optimized. Support for multi-GPU usage in graphs allows for expressing the multi gpu work efficiently as a single graph, but does not speedup cross gpu synchronization today.

My recomendation would be to make graphs the default. If you want to maintain a streams version & opt into using it for those large models on pre-ampere machines, go for it.

If you want to optimize your simulation for multiple gpus, I would look at trying to separate the problem into local and global concerns such that you can run multiple local timestep operations without having to communicate cross-gpu.

You might be interested in this article about how GROMACS optimized their simulation with graphs.
A Guide to CUDA Graphs in GROMACS 2023 | NVIDIA Technical Blog

Topic		Replies	Views
Getting Started with CUDA Graphs Technical Blog	11	2654	January 8, 2024
Advantage of Cuda Graphs? CUDA Programming and Performance	3	1277	June 28, 2023
Cuda fortran, graph nvc, nvc++ and nvfortran	12	725	July 8, 2022
Rtx3070 Why when I run the program with CUDA Graph and multiple streams, there's no parallelism observed when monitoring with nsys GPU-Accelerated Libraries cuda	6	151	September 18, 2025
Multi-stream graph CUDA Programming and Performance	3	668	February 5, 2025
Multiple launches of a single cudaGraphExec_t executing in parallel, in contrast to documentation? CUDA Programming and Performance	9	1007	March 18, 2022
Synchronization of CUDA Graphs launched on single node multi GPUs setup CUDA Programming and Performance synchronization	6	262	July 31, 2025
Using multi streams in cuda graph, the execution order is uncontrolled CUDA Programming and Performance cuda	11	3496	August 23, 2023
Multistream in cudagraph capturing CUDA Programming and Performance	1	480	February 6, 2025
Why cudaGraphLaunch(graph_exec_, stream1) dont run the graph at stream1 CUDA Programming and Performance cuda , graphics	1	95	June 6, 2025

CUDA Graph multi-GPU performance

Related topics