CUDA Graph multi-GPU performance

Hello,

I am writting a multi-GPU accelerated simulation that consists of simple kernels computing basic per element operations and FFT implemented by extended cuFFT library. To achieve a better performance, I have decided to rewrite the application using CUDA Graph API. The main goal was to improve overhead due to context switching on large GPU systems like DGX-2 and DGX A100. However the results are not as good as expected for larger simulations.

The original application was written in a “traditional” way:

for (std::size_t step = 0ul; step < nSteps; ++step)
{
  for (int i = 0; i < nDevices; ++i)
  {
    cudaSetDevice(i);
    cudaLaunchKernel(/* params */);
  }
  ...
  cufftXtExecDescriptor(/* params */);
  ...
  /* some other computations */
}

cudaDeviceSynchronize();

The idea was to use CUDA Graph API through stream capture which is supported by cuFFT library as well. Here is an equivalent code using CUDA Graph API:

cudaGraph_t     graph;
cudaExecGraph_t execGraph;

cudaStream_t mainStream;
cudaEvent_t  mainStreamForkEvent;
cudaEvent_t  mainStreamPostFftSyncEvent;

cudaStream_t privateStream[nDevices];
cudaEvent_t  privateStreamJoinEvent[nDevices];
cudaEvent_t  privateStreamPreFftSyncEvent[nDevices];
---------------------------------------------------------------------------------
/* initialization of streams & events */

// set fft stream
cufftSetStream(mainStream);
---------------------------------------------------------------------------------
cudaStreamBeginCapture(mainStream);

// put fork event in main stream
cudaEventRecord(mainStreamForkEvent, mainStream);

for (int i = 0; i < nDevices; ++i)
{
  cudaSetDevice(i);
  
  // make private streams wait for fork event
  cudaStreamWaitEvent(privateStream[i], mainStreamForkEvent);

  // launch kernel in private streams
  cudaLaunchKernel(/* params */, privateStream[i]);

  // put pre-fft synchronization event in private streams
  cudaEventRecord(privateStreamPreFftSyncEvent[i], privateStream[i]);
  // make main stream wait for work in private stream to finish
  cudaStreamWaitEvent(masterStream, privateStreamPreFftSyncEvent[i]);
}

// execute fft
cufftXtExecDescriptor(/* params */);

// record post-fft synchronization event in master stream
cudaEventRecord(mainStreamPostFftSyncEvent, mainStream);
for (int i = 0; i < nDevices; ++i)
{
  // make private streams wait for fft to be done before launching other jobs
  cudaStreamWaitEvent(privateStream[i], mainStreamPostFftSyncEvent);
}

/* other computations in private streams */

for (int i = 0; i < nDevices; ++i)
{
  // put join events in private streams
  cudaEventRecord(privateStreamJoinEvent[i], privateStream[i]);
  // make main stream wait for jobs in private streams to finish
  cudaStreamWaitEvent(mainStream, privateStreamJoinEvent[i]);
}

cudaStreamEndCapture(mainStream, &graph);
---------------------------------------------------------------------------------
cudaGraphInstantiate(&execGraph, graph);

for (std::size_t step = 0ul; step < nSteps; ++step)
{
  cudaGraphLaunch(execGraph, mainStream)
}
cudaStreamSynchronize(mainStream);
---------------------------------------------------------------------------------
/* deinitialization */

There is a mainStream stream which is used in as the captured stream (created in context 0) and as the stream for FFT launch. Jobs on each GPU are launched in their private streams privateStream[]. There are also several events necessary for initial fork and final join of the streams and events for pre-fft and post-fft synchronization as cuFFT enables launching only in one stream. The final graph is then asynchronously launched in the mainStream.

The solution works quite well on small simulations, however in larger simulations, an overhead occures, in fact on some architectures CUDA Graph implementation is slower than the original one. Here is a measurement on 3D simulation using 4x V100 GPUs:

128 256 384 512 640 768 896
Original 0.0618 0.3312 1.1079 2.4134 4.8074 7.6897 12.1971
CUDA Graph 0.0445 0.3464 1.1077 2.3661 4.9318 7.8198 12.3115

The results look much better on the same simulation using 4x A100 GPUs:

128 256 384 512 640 768 896
Original 0.0612 0.1679 0.4726 0.9756 1.9140 3.3814 5.7710
CUDA Graph 0.0437 0.1403 0.4397 0.9297 1.8636 3.3172 5.6669

So my question is: how well is CUDA Graph API optimized for multi-GPU usage? Is there a better approach? Should I for example capture multiple iterations?

Also it would be really nice to get rid of the pre and post fft synchronization, cuFFT creates other private streams anyway…

Thank you very much.

David

Looks like you are always getting speedup on the ampere case.
Ampere added more hardware features used by graphs to efficiently re-launch work.

It seems that in many cases, graphs are giving you a good speedup. The cases where there is more overhead than speedup seem to have a small negative impact <3%?

I guess the question is: do you have enough Volta machines running enough of the workloads that suffer by 3% to make it worth maintaining the streams only path / figuring out where the break even points are?

WRT: “how well is CUDA Graph API optimized for multi-GPU usage?” Only the dependencies on a given device are really optimized. Support for multi-GPU usage in graphs allows for expressing the multi gpu work efficiently as a single graph, but does not speedup cross gpu synchronization today.

My recomendation would be to make graphs the default. If you want to maintain a streams version & opt into using it for those large models on pre-ampere machines, go for it.

If you want to optimize your simulation for multiple gpus, I would look at trying to separate the problem into local and global concerns such that you can run multiple local timestep operations without having to communicate cross-gpu.

You might be interested in this article about how GROMACS optimized their simulation with graphs.
A Guide to CUDA Graphs in GROMACS 2023 | NVIDIA Technical Blog