Hello,
I am writting a multi-GPU accelerated simulation that consists of simple kernels computing basic per element operations and FFT implemented by extended cuFFT library. To achieve a better performance, I have decided to rewrite the application using CUDA Graph API. The main goal was to improve overhead due to context switching on large GPU systems like DGX-2 and DGX A100. However the results are not as good as expected for larger simulations.
The original application was written in a “traditional” way:
for (std::size_t step = 0ul; step < nSteps; ++step)
{
for (int i = 0; i < nDevices; ++i)
{
cudaSetDevice(i);
cudaLaunchKernel(/* params */);
}
...
cufftXtExecDescriptor(/* params */);
...
/* some other computations */
}
cudaDeviceSynchronize();
The idea was to use CUDA Graph API through stream capture which is supported by cuFFT library as well. Here is an equivalent code using CUDA Graph API:
cudaGraph_t graph;
cudaExecGraph_t execGraph;
cudaStream_t mainStream;
cudaEvent_t mainStreamForkEvent;
cudaEvent_t mainStreamPostFftSyncEvent;
cudaStream_t privateStream[nDevices];
cudaEvent_t privateStreamJoinEvent[nDevices];
cudaEvent_t privateStreamPreFftSyncEvent[nDevices];
---------------------------------------------------------------------------------
/* initialization of streams & events */
// set fft stream
cufftSetStream(mainStream);
---------------------------------------------------------------------------------
cudaStreamBeginCapture(mainStream);
// put fork event in main stream
cudaEventRecord(mainStreamForkEvent, mainStream);
for (int i = 0; i < nDevices; ++i)
{
cudaSetDevice(i);
// make private streams wait for fork event
cudaStreamWaitEvent(privateStream[i], mainStreamForkEvent);
// launch kernel in private streams
cudaLaunchKernel(/* params */, privateStream[i]);
// put pre-fft synchronization event in private streams
cudaEventRecord(privateStreamPreFftSyncEvent[i], privateStream[i]);
// make main stream wait for work in private stream to finish
cudaStreamWaitEvent(masterStream, privateStreamPreFftSyncEvent[i]);
}
// execute fft
cufftXtExecDescriptor(/* params */);
// record post-fft synchronization event in master stream
cudaEventRecord(mainStreamPostFftSyncEvent, mainStream);
for (int i = 0; i < nDevices; ++i)
{
// make private streams wait for fft to be done before launching other jobs
cudaStreamWaitEvent(privateStream[i], mainStreamPostFftSyncEvent);
}
/* other computations in private streams */
for (int i = 0; i < nDevices; ++i)
{
// put join events in private streams
cudaEventRecord(privateStreamJoinEvent[i], privateStream[i]);
// make main stream wait for jobs in private streams to finish
cudaStreamWaitEvent(mainStream, privateStreamJoinEvent[i]);
}
cudaStreamEndCapture(mainStream, &graph);
---------------------------------------------------------------------------------
cudaGraphInstantiate(&execGraph, graph);
for (std::size_t step = 0ul; step < nSteps; ++step)
{
cudaGraphLaunch(execGraph, mainStream)
}
cudaStreamSynchronize(mainStream);
---------------------------------------------------------------------------------
/* deinitialization */
There is a mainStream stream which is used in as the captured stream (created in context 0) and as the stream for FFT launch. Jobs on each GPU are launched in their private streams privateStream[]. There are also several events necessary for initial fork and final join of the streams and events for pre-fft and post-fft synchronization as cuFFT enables launching only in one stream. The final graph is then asynchronously launched in the mainStream.
The solution works quite well on small simulations, however in larger simulations, an overhead occures, in fact on some architectures CUDA Graph implementation is slower than the original one. Here is a measurement on 3D simulation using 4x V100 GPUs:
| 128 | 256 | 384 | 512 | 640 | 768 | 896 | |
|---|---|---|---|---|---|---|---|
| Original | 0.0618 | 0.3312 | 1.1079 | 2.4134 | 4.8074 | 7.6897 | 12.1971 |
| CUDA Graph | 0.0445 | 0.3464 | 1.1077 | 2.3661 | 4.9318 | 7.8198 | 12.3115 |
The results look much better on the same simulation using 4x A100 GPUs:
| 128 | 256 | 384 | 512 | 640 | 768 | 896 | |
|---|---|---|---|---|---|---|---|
| Original | 0.0612 | 0.1679 | 0.4726 | 0.9756 | 1.9140 | 3.3814 | 5.7710 |
| CUDA Graph | 0.0437 | 0.1403 | 0.4397 | 0.9297 | 1.8636 | 3.3172 | 5.6669 |
So my question is: how well is CUDA Graph API optimized for multi-GPU usage? Is there a better approach? Should I for example capture multiple iterations?
Also it would be really nice to get rid of the pre and post fft synchronization, cuFFT creates other private streams anyway…
Thank you very much.
David