MPI Fortran on Archlinux

I’m a maintainer of the nvhpc AUR package on ArchLinux. There, users are reporting on a problem with NVHPC 22.11 on running mpi programs with fortran. In particular, the following code can be compiled but when running creates issues as shown here: AUR (en) - nvhpc and AUR (en) - nvhpc

program hello
 use mpi_f08
 implicit none
 !include "mpif.h"

 integer :: rank, nprocs, ierr 
 call MPI_INIT( ierr )
 call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )
 call MPI_COMM_SIZE( MPI_COMM_WORLD, nprocs, ierr )

 print '(a(i2)a(i2)a)', "hello from ", rank, " among " , nprocs, " procs"

 call MPI_FINALIZE( ierr )

end program hello

In order to reproduce the example, one can install nvhpc from the official Linux x86_64 Tarball, get all the compilers in path and then compile the above program using mpif90 -o buggy buggy.f90 and then run the program as mpirun -n 4 buggy

What can be the cause of this problem?

UPDATE: The crashes can be reproduced on previous NVHPC versions as well.

Hi j.badwaik,

What can be the cause of this problem?

I unable to reproduce the issue here, but in looking at the traceback from the second link, I see several unexpected libraries. In particular PMIX and a local library found in “/usr/lib/openmpi/”. Hence I suspect that there’s some type of bad interaction with the OpenMPI we ship and another installation on this system.

Which “mpirun” are you using? The one we ship or is the environment picking up a different one?

What is the output from the command:

ldd /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpirun

You might try using the OpenMPI 4 we ship as well, “/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi4/openmpi-4.0.5”, in case it gives you a different behavior. Though it’s unclear why PMIX is getting in there so it may not matter

I’ll talk with the folks who build the OpenMPIs we ship for ideas.

Archlinux is not one of our supported platforms. It doesn’t mean that things shouldn’t work, but only that we don’t test on it. It’s a holiday week so it may take a bit, but I’ll ask our IT folks if they can install it someplace. If I can recreate the issue, it will be a lot easier to determine the problem.

-Mat

Dear Mat,

Thank you for your reply. I tried to isolate the MPIs into a docker container and was able to reproduce the error in this example: sutarwadi / nvhpc-archlinux-bug-2211 · GitLab

I’ve now run this using Rockylinux as the host system, and currently have a job running using Arch as a host system.

We do not officially support Arch Linux, and we have not done much testing with the HPC SDK on that platform. (We do not have any machines in our test lab that run Arch that I am aware of.) That said, I do run Arch Linux on a couple of my home systems, so it was easy enough for me to snag this and reproduce the issue.

I don’t know the root cause of the issue at this point. However, I do note that our Open MPI 4.0.5 and HPC-X (based on Open MPI 4.1) builds do not demonstrate this problem. As a workaround for now, I would suggest switching to the Open MPI 4.0.5 or HPC-X builds within the HPC SDK on Arch. HPC-X will eventually become our default MPI at some point in the future.

A bit further internal testing here seems to indicate that the issue is confined to Fortran. I rewrote the example above in C, and it worked with Open MPI 3.1.5 on Arch. Unclear if this is an MPI-only issue, or if other Fortran programs outside of MPI would be affected by it. My guess is that it’s probably Open MPI 3.1.5 only, as the other MPI variants in the HPC SDK worked for me.

Thank you for looking into the problem. I can confirm that OpenMPI 4.0.x works for me for now. I am not completely sure how to use HPC-X as the default, so for now, I’m patching the AUR package to use the OpenMPI 4.0.x and let’s hope it solves the problem for now. PKGBUILD · master · Jayesh Badwaik / ArchLinux / AUR / nvhpc · GitLab

Thank you again.