Does GPU direct RDMA stage through CPU?

Dear staff,

I am profling an MPI-GPU application (OpenACC+MPI+CUDAFortran) configured with hpcx inside nvhpc/24.3 to analyze communications. These are implemented via aware MPI APIs, in particular MPI_Isend+MPI_Irecv+MPI_Waitall. All ranks communicate with the other ranks. In this run I am running 8 MPI ranks-gpus distributed on 2 nodes. When I look at the report with nsys-ui I noticed that the MPI_Waitall does a D2H copy (lilla events in the picture below) which I did not expect if the network supports GPU direct RDMA. I see Peer to Peer communications between ranks inside the same node, but I do not understand why the datum is staged through the CPU from within the MPI_Waitall. I guess this D2H copies are needed to move tha data to an MPI rank outside the node? Is this behaviour possible with GPU direct RDMA available? I also tried setting export UCX_IB_GPU_DIRECT_RDMA=y, but I did not notice differences.

Thank you for your help,

Laura

The behavior related with HW topo. I think maybe your 8 rank process on GPU on 2 node lack of IB HCA communicated. ideally 1:1 GPU:HCA will fully leverage GDR(GPU DIRECT RDMA), if lack of HCA GPU will share comms, then D2H copy then regular RDMA between nodes.

But also may related with MPI jobs you run, maybe some need CPU process then D2H.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.