Hi,
I have been testing the new nvhpc 24.1 on our multigpu MPI+OpenACC test code, availble here:
I am using the main code (that uses manual data managment).
I am compiling using the compile line in the README (with the cc changed to the current GPU):
mpif90 psi_multigpu_test_code.f -acc=gpu -gpu=cc80,nomanaged,nounified -Minfo=accel -o psi_multigpu_test_code
I am loading the nvhpc 24.1 using the following (from the documentation):
version=24.1
NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/$version/compilers/bin:$PATH; export PATH
export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/man
When I do this, the test code works fine on the Delta system at NCSA on an A100, and on a local workstation with a GTX 2080Ti. It even works out-of-the-box on multiple nodes of Delta (nice!).
However, when I try it on three additional systems (with a RTX 3090Ti, RTX 3060Ti, and RTX 4070 respectively) the code seg faults when it hits the MPI Wait call after the MPI iSend and iRecv involving a derived type array directly with host_data
:
Starting seam cycles...
[rolly-linux:129326:0:129326] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f1b595c8580)
15 0x000000000004e299 ompi_waitall_f() /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
16 0x00000000004043ca seam_vvec_()
<PATH>/psi_multigpu_test_code/psi_multigpu_test_code.f:700
The CUDA driver version, CUDA runtime version, etc. is identical across all workstations.
The only difference I can see is the Linux kernel version.
On the two systems that worked, the Linux kernel was 4.18 and 5.15.
On the three systems that did not work, the kernel was 6.5.
I have also tried invoking the hpcx init script and the hpcx_load command and it still doesn’t work.
However, if I revert back to using the OpenMPI 3 that is (thankfully) still included in nvhpc by using:
export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/man
the code works!
Maybe there is some kind of incompatibility of the HPCX MPI with Linux kernel 6?
Thanks!
– Ron