MPI_Allreduce + GPU RDMA with MPI_IN_PLACE hangs with new HPC SDK versions

I have this test to check if MPI collective operations work with GPU RDMA and OpenMP:

int main (int argc, char *argv[]) {
   MPI_Init(NULL, NULL);
   int mpi_size, rank;
   MPI_Comm_size (MPI_COMM_WORLD, &mpi_size);
   MPI_Comm_rank (MPI_COMM_WORLD, &rank);
   omp_set_default_device(rank);

   size_t N =134217728 ;
   int *inplace_buf = (int*)malloc(N * sizeof(int));
   int *sbuf = (int*)malloc(N * sizeof(int));
   int *rbuf = (int*)malloc(N * sizeof(int));

   for (int i = 0; i < N; i++) {
      inplace_buf[i] = rank + i;
      sbuf[i] = rank + i;
      rbuf[i] = 0;
   }

   int mpi_err;
#pragma omp target enter data map(to:sbuf[0:N],rbuf[0:N])
#pragma omp target data use_device_ptr(rbuf,sbuf)
{
   mpi_err = MPI_Allreduce (sbuf, rbuf, N, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
}

#pragma omp target exit data map(from:rbuf[0:N])
#pragma omp target exit data map(delete:sbuf)

   if (mpi_err) {
      printf ("OUT_OF_PLACE: MPI ERROR\n");
      return -1;
   } else {
      printf ("OUT_OF_PLACE: SUCCESS\n");
   }

#pragma omp target enter data map(to:inplace_buf[0:N])
#pragma omp target data use_device_ptr(inplace_buf)

{
   mpi_err = MPI_Allreduce (MPI_IN_PLACE, inplace_buf, N, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
}

#pragma omp target exit data map(from:inplace_buf[0:N])

   if (mpi_err) {
      printf ("IN_PLACE: MPI ERROR\n");
      return -1;
   } else {
      printf ("IN_PLACE: SUCCESS\n");
   }

   MPI_Finalize();
}

This test works fine with HPC SDK 23.11 and the nvhpc-hpcx-cuda12 therein. With newer versions of the SDK, especially the newest one 24.11, I do not obtain the second output for MPI_IN_PLACE. Instead, program execution is hanging in MPI_Allreduce. To be precise, I compile like this

mpicc -mpi=gpu test.c -o test.x

and run it with two or more processes to obtain the erroneous behavior

mpirun -np 2 ./test.x

With only one rank, the test behaves fine.
Using strace I know that the program execution hangs in an ioctl command, probably when talking to the device.
I have suspected that the issue might originate in the mismatch between the installed CUDA version (12.4), and the one HPC SDK version 24.11 is built for (12.6). Therefore, I took the latest HPC SDK version that is built for CUDA 12.4 (24.5), and explicitly compiled using the -gpu=cuda12.4 flag. This does not resolve the issue. Neither does compiling with 23.11 and running with 24.11. Maybe the problem lies in the communication libraries?

Do you have an idea what I might be missing out on or what causes the problem?

I don’t know for sure, but my best guess is that there is some type of file locking with the collective. I’ll need to ask Chris to take a look and possibly talk to the HPC-X folks.

I can work around the error by using the shared memory transport instead of the cuda transport:

 % mpirun -np 2 -x UCX_TLS=sm -mca coll_hcoll_enable 0 ./test.x
OUT_OF_PLACE: SUCCESS
OUT_OF_PLACE: SUCCESS
IN_PLACE: SUCCESS
IN_PLACE: SUCCESS

Also, using the OpenMPI 4 install (“24.11/comm_libs/openmpi4”) instead of HPC-X, works around it as well.

-Mat

Hi Mat,

thanks for forwarding this issue. Do I understand it correctly that the workaround components do not use the GPU and do not allow for GPU RDMA?

Chris let me know that he’s working on it and has filed a problem report, TPR #36819 to track. He’ll likely need to send it over the the HPC-X team.