I have this test to check if MPI collective operations work with GPU RDMA and OpenMP:
int main (int argc, char *argv[]) {
MPI_Init(NULL, NULL);
int mpi_size, rank;
MPI_Comm_size (MPI_COMM_WORLD, &mpi_size);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
omp_set_default_device(rank);
size_t N =134217728 ;
int *inplace_buf = (int*)malloc(N * sizeof(int));
int *sbuf = (int*)malloc(N * sizeof(int));
int *rbuf = (int*)malloc(N * sizeof(int));
for (int i = 0; i < N; i++) {
inplace_buf[i] = rank + i;
sbuf[i] = rank + i;
rbuf[i] = 0;
}
int mpi_err;
#pragma omp target enter data map(to:sbuf[0:N],rbuf[0:N])
#pragma omp target data use_device_ptr(rbuf,sbuf)
{
mpi_err = MPI_Allreduce (sbuf, rbuf, N, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
}
#pragma omp target exit data map(from:rbuf[0:N])
#pragma omp target exit data map(delete:sbuf)
if (mpi_err) {
printf ("OUT_OF_PLACE: MPI ERROR\n");
return -1;
} else {
printf ("OUT_OF_PLACE: SUCCESS\n");
}
#pragma omp target enter data map(to:inplace_buf[0:N])
#pragma omp target data use_device_ptr(inplace_buf)
{
mpi_err = MPI_Allreduce (MPI_IN_PLACE, inplace_buf, N, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
}
#pragma omp target exit data map(from:inplace_buf[0:N])
if (mpi_err) {
printf ("IN_PLACE: MPI ERROR\n");
return -1;
} else {
printf ("IN_PLACE: SUCCESS\n");
}
MPI_Finalize();
}
This test works fine with HPC SDK 23.11 and the nvhpc-hpcx-cuda12
therein. With newer versions of the SDK, especially the newest one 24.11, I do not obtain the second output for MPI_IN_PLACE
. Instead, program execution is hanging in MPI_Allreduce. To be precise, I compile like this
mpicc -mpi=gpu test.c -o test.x
and run it with two or more processes to obtain the erroneous behavior
mpirun -np 2 ./test.x
With only one rank, the test behaves fine.
Using strace
I know that the program execution hangs in an ioctl command, probably when talking to the device.
I have suspected that the issue might originate in the mismatch between the installed CUDA version (12.4), and the one HPC SDK version 24.11 is built for (12.6). Therefore, I took the latest HPC SDK version that is built for CUDA 12.4 (24.5), and explicitly compiled using the -gpu=cuda12.4
flag. This does not resolve the issue. Neither does compiling with 23.11 and running with 24.11. Maybe the problem lies in the communication libraries?
Do you have an idea what I might be missing out on or what causes the problem?