Tracing data copies with CUDA-AWARE MPI


I am tracing MPI communications and data movement in an MPI+OPENACC+CUDA application over multiple nodes of Leonardo cluster at CINECA. The code does a number of ISEND IRECV + WAITALL among 8 ranks spread on 2 nodes, 4 per node, each one binded to a gpu.

When I look at the profile, I see the Peer to Peer memcopies due to isend e irecv operations, but only 3 over 7 expected, and according to source and destionation id such traces refer only to the gpus on the same node. By tracing MPI communications I can see the correct number of isend, irecv ranges and a of course longer trace for the WAITALL phase. By tracing UCX I can see that some longer “UCP transfer processing” ranges show some long transfer occuring during the MPI WAITALL. Can you please explain why the memcopies between devices on different nodes is not visible in the CUDA HW row and how possibly trace it?

I attach a picture from the profiling session.

Thank you for your help,


@rdietrich can you please help Laura?

Hello Laura!

it may be that we cannot record the GPU-to-GPU communication for GPUDirect inter-node communication. I’ll check that.

Can you check the CUDA API row and see if the calls match you expectations or if the CUDA API calls are also missing?
How does your mpirun/srun command look?


I see only 3 cudaMemcptDtoDAsync from CUDA API within MPI_Waitalll and 3 corresponding Memcpy PtoP in the CUDA HW panel, which are not explicit in the source code (I suppose called from gpu-aware implementation of isend irecv); these do not match my expectations.

Differently, I can see all the 7 cudaMemcpy2DAsync that are explicitely called in the source code (and corresponding trace in cuda hardware panel). In the latter, “Source memory kind” and “Destination memory kind” are “Device” with no ID; these match my expectations.

The program is launched with mpirun.

Thank you for your help,


After a bit of digging, I have to admit that Nsight Systems cannot record GPU-to-remote-GPU (GPUDirect RDMA) memory copies.

You can still get some information on the GPU data transfer via GPU metrics sampling (see GPU Metrics and Avoid redundant GPU and NIC metrics collection in the Nsight Systems user guide). This should give you the PCIe throughput for send and receive. Together with the MPI and the UCP transfer processing ranges (UCX) and optionally also the NIC metrics you should see when and where data is transferred.

This is not as nice as a simple PtoP range, but when data is transferred on the direct path between remote GPUs, normal GPU memcpy tracing is also bypassed. If you still want to follow the data, you can set UCX_IB_GPU_DIRECT_RDMA=no, which should bring back Memcpy ranges into the timeline (of course at the cost of bandwidth and latency).

The Multi-GPU Programming with MPI presentation by @jkraus provides a pretty nice overview on what’s possible.