Tracing data copies with CUDA-AWARE MPI

l.bellentani · September 25, 2023, 7:55am

Hello,

I am tracing MPI communications and data movement in an MPI+OPENACC+CUDA application over multiple nodes of Leonardo cluster at CINECA. The code does a number of ISEND IRECV + WAITALL among 8 ranks spread on 2 nodes, 4 per node, each one binded to a gpu.

When I look at the profile, I see the Peer to Peer memcopies due to isend e irecv operations, but only 3 over 7 expected, and according to source and destionation id such traces refer only to the gpus on the same node. By tracing MPI communications I can see the correct number of isend, irecv ranges and a of course longer trace for the WAITALL phase. By tracing UCX I can see that some longer “UCP transfer processing” ranges show some long transfer occuring during the MPI WAITALL. Can you please explain why the memcopies between devices on different nodes is not visible in the CUDA HW row and how possibly trace it?

I attach a picture from the profiling session.

Thank you for your help,

Laura

hwilper · September 25, 2023, 2:21pm

@rdietrich can you please help Laura?

rdietrich · September 26, 2023, 8:05am

Hello Laura!

it may be that we cannot record the GPU-to-GPU communication for GPUDirect inter-node communication. I’ll check that.

Can you check the CUDA API row and see if the calls match you expectations or if the CUDA API calls are also missing?
How does your mpirun/srun command look?

l.bellentani · September 26, 2023, 10:41am

Hello,

I see only 3 cudaMemcptDtoDAsync from CUDA API within MPI_Waitalll and 3 corresponding Memcpy PtoP in the CUDA HW panel, which are not explicit in the source code (I suppose called from gpu-aware implementation of isend irecv); these do not match my expectations.

Differently, I can see all the 7 cudaMemcpy2DAsync that are explicitely called in the source code (and corresponding trace in cuda hardware panel). In the latter, “Source memory kind” and “Destination memory kind” are “Device” with no ID; these match my expectations.

The program is launched with mpirun.

Thank you for your help,

Laura

rdietrich · September 26, 2023, 1:32pm

After a bit of digging, I have to admit that Nsight Systems cannot record GPU-to-remote-GPU (GPUDirect RDMA) memory copies.

You can still get some information on the GPU data transfer via GPU metrics sampling (see GPU Metrics and Avoid redundant GPU and NIC metrics collection in the Nsight Systems user guide). This should give you the PCIe throughput for send and receive. Together with the MPI and the UCP transfer processing ranges (UCX) and optionally also the NIC metrics you should see when and where data is transferred.

This is not as nice as a simple PtoP range, but when data is transferred on the direct path between remote GPUs, normal GPU memcpy tracing is also bypassed. If you still want to follow the data, you can set UCX_IB_GPU_DIRECT_RDMA=no, which should bring back Memcpy ranges into the timeline (of course at the cost of bandwidth and latency).

The Multi-GPU Programming with MPI presentation by @jkraus provides a pretty nice overview on what’s possible.

Topic		Replies	Views
Understanding behavior of GPUDirect RDMA with Nsight profiling CUDA Programming and Performance	1	74	January 13, 2025
Unusually slow MPI communication between GPUs nvc, nvc++ and nvfortran	1	513	September 5, 2023
Memory increase in GPU-aware non-blocking MPI communications CUDA Programming and Performance	5	397	October 8, 2024
Nsight Systems not capturing "CUDA memcpy PtoP" messages Profiling Linux Targets nsight	7	1237	July 8, 2021
Long overhead with cuStreamSynchronize with OMPI Profiling Linux Targets nsight , openmpi	13	1528	September 15, 2021
An Introduction to CUDA-Aware MPI Technical Blog	5	956	August 30, 2019
Direct GPU-to-GPU data transfer with OpenACC+managed+MPI nvc, nvc++ and nvfortran	4	1110	April 12, 2022
Overlapping computation with MPI communication CUDA Programming and Performance	0	703	June 8, 2018
CUDA/MPI interoperability problem CUDA Programming and Performance	3	2060	December 20, 2013
Mixed CUDA and MPI programming CUDA Programming and Performance	7	8072	November 12, 2009

Tracing data copies with CUDA-AWARE MPI

Related topics