Long overhead with cuStreamSynchronize with OMPI

Hi guys,
I took a trace of a piece of a cuda-aware MPI application and it seems that cuStreamSynchronize has a 10x overhead. What is odd is that the memcpy finished a long time before the synchronize came back ( please see attached snapshot). What can the synchronization (Driver) be doing to not come back after the transfer is complete?
Thanks, Noob-Noob

Poke Any response?

I am sorry that there was no response to this earlier, your forum post was dropped in an orphaned category that the Nsys team was unaware of until this afternoon.

Can you give us the .qdrep in question so that we can take a closer look at it?

Hi, sorry I was only provided the screenshot. Are there any high level rationale on what could be causing this overhead? Or anything that can be done to optimize?

You might take a look at Understanding the Visualization of Overhead and Latency in NVIDIA Nsight Systems | NVIDIA Developer Blog.

Thank you for the document. Question- if I have a multi-process (MPI-based) program, and they all have a stream 26, will those submissions be serialized from the different processes within the driver (and/or hardware)?

Could you please give us more information on the application run, the used CUDA toolkit, CUDA driver and OpenMPI version?

How many MPI processes are running? Is it possible that they use the same GPU? Nevertheless, it is strange that the cuStreamSynchronize takes so much time after the memcopy. But it is tough to guess what the driver is waiting for from just the screenshot. The scheduling strategy could also be a reason for the long cuStreamSynchronize (see cuCtxCreate() docs).

CUDA 11.2, Driver 460.32.03, OpenMPI version (Open MPI) 5.1.0a1, 8 MPI Processes.

I have a feeling that OMPI is just using one stream (handle) across multiple GPUs and the driver might be getting confused, but I’m not sure. Its strange to me that all the processes has a stream 26. I’m guessing that might be OK if the default stream also the same handle (7) too. I have confirmed that the cuStreamSynchronize comes back a little bit after the last transfer completes on that stream across all the ranks (dunno if that helps).

Each MPI process will create its own CUDA context, each with its own CUDA streams. If the MPI processes are all doing the same CUDA operations, they will very likely also have the same CUDA stream IDs. If the MPI processes use different GPUs, there is no serialization between operations on different GPUs. If you use MPS, there is a good chance that CUDA kernels or memory copies are performed concurrently even on the same GPU. Without MPS, CUDA kernels/memcopies that are submitted from different MPI processes to the same GPU are serialized.

That would be my intuition as well. The transfers are happening in parallel, but the synchronization doesn’t come back until all those transfers across all the processes are complete. Also, this doesn’t seem to happen over NVLink (using P2P), just PCIe host HtoD/ DtoH copies.

This OpenMPI is more than up to date. Is MPS enabled on your system? Can you check that all MPI processes use a different CPU core (in your screenshot the CPU core for MPI rank 0 is colored in red - the bar/row between the black and the green one).

To get more information you could enable sampling with dwarf backtracing (nsys profile -s cpu -b dwarf). If you want, you can also share the report file (.qdrep) or your code so that we can try to reproduce it.

Another helpful data point for us would be to know if your cuStreamSynchronize calls are faster when Nsight Systems is not being used, or if they take this long even without tools. Would you mind adding some quick timing code around your MPI_Send calls? Those seem like they are being made much slower by long cuStreamSynchronize calls, so if you see MPI_Send get much faster when running without Nsight Systems, we’ll know it’s an issue with the tool.

An easy way to get good-enough precision timestamps is using C++11’s chrono library. You can add some helper code like this at the top of your .cpp file:

#include <chrono>

using std::chrono::duration_cast;
using msec = std::chrono::milliseconds;
using usec = std::chrono::microseconds;
using nsec = std::chrono::nanoseconds;

auto& now = std::chrono::high_resolution_clock::now;

…and then to time an MPI_Send call, get timestamps before and after it, and look at the difference. You can convert the difference to whatever units are convenient, like this:

auto before = now();
result = MPI_Send(...);
auto after = now();

int elapsedUsec = (int)duration_cast<usec>(after - before).count();

printf("MPI_Send duration: %d usec\n", elapsedUsec);

Then you can run your app with and without Nsight Systems, and see if the durations of MPI_Send calls change significantly. Knowing that would really help us!

Hi guys, the performance ratio with and without nsys is within <4%. The application is the master branch of GROMACS if that helps. The initial screenshot is a trace from that application.

Poke. Any update on this? I’m still trying to figure what is happening here. Any help is appreciated.