Dear staff,
I am profling an MPI-GPU application (OpenACC+MPI+CUDAFortran) configured with hpcx inside nvhpc/24.3 to analyze communications. These are implemented via aware MPI APIs, in particular MPI_Isend+MPI_Irecv+MPI_Waitall. All ranks communicate with the other ranks. In this run I am running 8 MPI ranks-gpus distributed on 2 nodes. When I look at the report with nsys-ui I noticed that the MPI_Waitall does a D2H copy (lilla events in the picture below) which I did not expect if the network supports GPU direct RDMA. I see Peer to Peer communications between ranks inside the same node, but I do not understand why the datum is staged through the CPU from within the MPI_Waitall. I guess this D2H copies are needed to move tha data to an MPI rank outside the node? Is this behaviour possible with GPU direct RDMA available? I also tried setting export UCX_IB_GPU_DIRECT_RDMA=y, but I did not notice differences.
Thank you for your help,
Laura