Long overhead with cuStreamSynchronize with OMPI

N00b-N00b · August 20, 2021, 6:08am

Hi guys,
I took a trace of a piece of a cuda-aware MPI application and it seems that cuStreamSynchronize has a 10x overhead. What is odd is that the memcpy finished a long time before the synchronize came back ( please see attached snapshot). What can the synchronization (Driver) be doing to not come back after the transfer is complete?
Thanks, Noob-Noob

N00b-N00b · August 25, 2021, 5:41pm

Poke Any response?

hwilper · August 27, 2021, 1:39am

I am sorry that there was no response to this earlier, your forum post was dropped in an orphaned category that the Nsys team was unaware of until this afternoon.

Can you give us the .qdrep in question so that we can take a closer look at it?

N00b-N00b · August 30, 2021, 6:51pm

Hi, sorry I was only provided the screenshot. Are there any high level rationale on what could be causing this overhead? Or anything that can be done to optimize?

hwilper · August 30, 2021, 7:56pm

You might take a look at https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/.

N00b-N00b · August 30, 2021, 8:39pm

Thank you for the document. Question- if I have a multi-process (MPI-based) program, and they all have a stream 26, will those submissions be serialized from the different processes within the driver (and/or hardware)?

rdietrich · August 30, 2021, 9:03pm

Could you please give us more information on the application run, the used CUDA toolkit, CUDA driver and OpenMPI version?

How many MPI processes are running? Is it possible that they use the same GPU? Nevertheless, it is strange that the cuStreamSynchronize takes so much time after the memcopy. But it is tough to guess what the driver is waiting for from just the screenshot. The scheduling strategy could also be a reason for the long cuStreamSynchronize (see cuCtxCreate() docs).

N00b-N00b · August 30, 2021, 9:17pm

CUDA 11.2, Driver 460.32.03, OpenMPI version (Open MPI) 5.1.0a1, 8 MPI Processes.

I have a feeling that OMPI is just using one stream (handle) across multiple GPUs and the driver might be getting confused, but I’m not sure. Its strange to me that all the processes has a stream 26. I’m guessing that might be OK if the default stream also the same handle (7) too. I have confirmed that the cuStreamSynchronize comes back a little bit after the last transfer completes on that stream across all the ranks (dunno if that helps).

rdietrich · August 30, 2021, 9:19pm

Each MPI process will create its own CUDA context, each with its own CUDA streams. If the MPI processes are all doing the same CUDA operations, they will very likely also have the same CUDA stream IDs. If the MPI processes use different GPUs, there is no serialization between operations on different GPUs. If you use MPS, there is a good chance that CUDA kernels or memory copies are performed concurrently even on the same GPU. Without MPS, CUDA kernels/memcopies that are submitted from different MPI processes to the same GPU are serialized.

N00b-N00b · August 30, 2021, 9:28pm

That would be my intuition as well. The transfers are happening in parallel, but the synchronization doesn’t come back until all those transfers across all the processes are complete. Also, this doesn’t seem to happen over NVLink (using P2P), just PCIe host HtoD/ DtoH copies.

rdietrich · August 30, 2021, 9:53pm

This OpenMPI is more than up to date. Is MPS enabled on your system? Can you check that all MPI processes use a different CPU core (in your screenshot the CPU core for MPI rank 0 is colored in red - the bar/row between the black and the green one).

To get more information you could enable sampling with dwarf backtracing (nsys profile -s cpu -b dwarf). If you want, you can also share the report file (.qdrep) or your code so that we can try to reproduce it.

jasoncohen · August 30, 2021, 10:24pm

Another helpful data point for us would be to know if your cuStreamSynchronize calls are faster when Nsight Systems is not being used, or if they take this long even without tools. Would you mind adding some quick timing code around your MPI_Send calls? Those seem like they are being made much slower by long cuStreamSynchronize calls, so if you see MPI_Send get much faster when running without Nsight Systems, we’ll know it’s an issue with the tool.

An easy way to get good-enough precision timestamps is using C++11’s chrono library. You can add some helper code like this at the top of your .cpp file:

#include <chrono>

using std::chrono::duration_cast;
using msec = std::chrono::milliseconds;
using usec = std::chrono::microseconds;
using nsec = std::chrono::nanoseconds;

auto& now = std::chrono::high_resolution_clock::now;

…and then to time an MPI_Send call, get timestamps before and after it, and look at the difference. You can convert the difference to whatever units are convenient, like this:

auto before = now();
result = MPI_Send(...);
auto after = now();

int elapsedUsec = (int)duration_cast<usec>(after - before).count();

printf("MPI_Send duration: %d usec\n", elapsedUsec);

Then you can run your app with and without Nsight Systems, and see if the durations of MPI_Send calls change significantly. Knowing that would really help us!

N00b-N00b · August 31, 2021, 10:36pm

Hi guys, the performance ratio with and without nsys is within <4%. The application is the master branch of GROMACS if that helps. The initial screenshot is a trace from that application.

N00b-N00b · September 15, 2021, 5:17pm

Poke. Any update on this? I’m still trying to figure what is happening here. Any help is appreciated.

Topic		Replies	Views
Long overhead with cuStreamSynchronize with OMPI CUDA-GDB pcie	2	1072	August 20, 2021
How to reduce the overhead from cudaStreamSynchronize? CUDA Programming and Performance	2	681	June 10, 2021
Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay? CUDA Programming and Performance	0	19	May 2, 2026
stream synchronize problem CUDA Programming and Performance	2	794	August 28, 2017
Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay? CUDA Programming and Performance	1	20	May 4, 2026
Memory Transfer activities during cudaStreamSynchronize Profiling Linux Targets nsight	2	541	July 7, 2020
Cuda streams vs Cuda+MPI How the different CPU processes access to the GPU? CUDA Programming and Performance	13	16186	March 20, 2011
cudaStreamSynchronize(a_stream) simpleStreams CUDA Programming and Performance	2	2452	December 2, 2010
I am curious about why some program become faster when calling cudaStreamSynchronize during program running. CUDA Programming and Performance	3	861	January 21, 2019
cudaStreamSynchronize CUDA Programming and Performance cuda	4	195	July 17, 2024

Long overhead with cuStreamSynchronize with OMPI

Related topics