Hello Nvidia family,
This is my very first post as a NVIDIA developer on this forum. Please forgive me if this post is not according to the forum standards. Any help to improve this post would be appreciated.
This question is specific to CUDA-Aware MPI and the RDMA transfers.
Ref: GPUDirect | NVIDIA Developer
Short story of this topic: We have installed GPU Direct RDMA kernel driver nvidia_peermem on our cluster that allows direct GPU peer-to-peer communication between nodes. Inspite of the driver available, when the data exchange takes places between 2 nodes, I can still see (from the nsight report) the data being offloaded to CPU from GPU. I expected cudaMemcpy P2P transfers and not the cudaMemcpy D2H transfers in the nsight trace file. The Device to Host and Host to Device transfers occur when GPU stages the data (to be sent) via CPU memory into CUDA Pinned buffers. But with the GPU Direct RDMA driver installed, these staging transfers should be eliminated.
Note: Please see below for the snapshots from nsight report to understand the problem correctly. =)
Long story:
We have 2 cluster,
- ‘Cluster 1’ with 8xA100s (SXM and NVLink) and no GPU Direct RDMA driver.
OpenMPI compiled with CUDA aware support. Open MPI: 4.1.6. Uses UCX fabric (compiled with verbs)
NVHPC SDK toolkit. V12.2.91
No nvidia_peermem kernel driver.
Has Infiniband 2xHDR100 ConnectX-6. - ‘Cluster 2’ with 4xH100s (SXM and NVLink) and installed GPU Direct RDMA driver.
OpenMPI compiled with CUDA aware support. Open MPI: 5.0.5. Uses UCX fabric (compiled with verbs)
NVHPC SDK toolkit. V12.6.77
nvidia_peermem kernel driver support.
Has Infiniband 2xNDR200 ConnectX-7.
We communicate between 2 nodes using a CUDA-aware MPI PingPing code, that allows us to measure effective communication bandwidth between 2 nodes in a cluster. Basically sending the buffers from GPU to another GPU between 2 nodes in an asynchronous bidirectional communication pattern. Below are the observed behavior when executing the CUDA-aware MPI PingPing code with and without the GPU Direct RDMA driver.
I use this command to execute my code’s MPI executable with 2 ranks(1 rank and 1 GPU per node):
mpirun --report-bindings --map-by ppr:1:node -n 2 nsys profile --trace=mpi,cuda --mpi-impl=openmpi ./cuda_aware
Cluster1. When we perform PingPing operation between 2 nodes using CUDA-aware MPI, we should expect (because there is no GPU Direct RDMA driver in Cluster 1) that the GPU should stage, the buffers to be sent, via Pinned CUDA buffers in CPUs memory. And we see the exact expected behavior from the nsight report below:
Wee see Device to Host memcpy for sending data to another node and Host to Device memcpy to receive the data from other node. Fine, no problem.
Cluster 2. Now, when we perform PingPing operation between 2 nodes, we should expect Peer to Peer transfers in nsight report. Instead I see confusing things:
- No cudamemcpy Peer 2 Peer operations.
- Only cudamemcpy Device 2 Host operations.
- No cudamemcpy Host 2 Device operations.
Fig:
This puts in me in lots of questions:
- Is this is the correct command to run MPI program that individually capture the trace files with RDMA transfers ?
- What would be the correct way to make sure that RDMA transfers are working ?
- Is the nsys tracer not able to correctly capture the traces ? Is it reporting wrong operations for RDMA transfers ? Like, it should show cudamemcpy P2P operations for RDMA transfers but somehow shows cudamemcpy D2H operations ?
- If the nsys tracer is not wrong, why dont I see the remaining half of cudamemcpy H2D operations if GPU is still staging through the CPU memory and no RDMA transfers are taking place ?