It appears that all of the libraries in the Cuda 12.2 stack of the 23.7 variant of NVHPC have been built against Cuda 12.1 instead of Cuda 12.2.
For example,
Linux_x86_64/23.7/comm_libs/12.2/hpcx/latest/ompi/bin$ ./ompi_info
Shows
Configure command line: 'CC=gcc' 'CXX=g++' 'FC=nvfortran'
'LDFLAGS=-Wl,-rpath-link=/proj/nv/libraries/Linux_x86_64/23.7/hpcx-12/230483-rel-2/comm_libs/12.1/hpcx/hpcx-2.15/ucx/lib
-Wl,-rpath-link=/proj/nv/libraries/Linux_x86_64/23.7/hpcx-12/230483-rel-2/comm_libs/12.1/hpcx/hpcx-2.15/hcoll/lib'
'--with-platform=../contrib/platform/nvhpc/optimized'
'--enable-mpi1-compatibility'
'--with-libevent=internal' '--without-xpmem'
'--with-slurm'
'--with-cuda=/proj/cuda/12.1/Linux_x86_64'
'--with-hcoll=/proj/nv/libraries/Linux_x86_64/23.7/hpcx-12/230483-rel-2/comm_libs/12.1/hpcx/hpcx-2.15/hcoll'
'--with-ucc=/proj/nv/libraries/Linux_x86_64/23.7/hpcx-12/230483-rel-2/comm_libs/12.1/hpcx/hpcx-2.15/ucc'
'--with-ucx=/proj/nv/libraries/Linux_x86_64/23.7/hpcx-12/230483-rel-2/comm_libs/12.1/hpcx/hpcx-2.15/ucx'
'--prefix=/proj/nv/libraries/Linux_x86_64/23.7/hpcx-12/230483-rel-2/comm_libs/12.1/hpcx/hpcx-2.15/ompi'
Hi jlagrone,
This is expected as HPC-X 2.16, which ships with the NVHPC 23.7 release, is built against 12.1. There’s a 12.2 path for it due to the way the trampolines work to select the matching CUDA version installed on the system and users don’t need to adjust paths.
-Mat
The modules are not set up correctly either, so that’s not true. For example 23.7/modulefiles/nvhpc-hpcx-cuda12/23.7
sets set hpcxmoddir $nvcommdir/12.1/hpcx/latest/modulefiles
which doesn’t exist
This is expected, although I apologize for the confusion.
23.5 introduced the use of trampoline drivers for the MPI wrappers, including mpicc
, mpirun
, etc. These trampolines attempt to detect the version of the CUDA driver installed on the system, and then forward the MPI wrapper invocation to the correct flavor of MPI most closely matching the CUDA version of the running system.
The issue here is that the default CUDA 12 version in the 23.7 release was updated to 12.2. However, the version of CUDA that HPC-X 2.15 is built against is still 12.1, as you noted. The Open MPI 4.1.5 build in that release is also built against 12.1. Unfortunately, due to the way the trampolines are implemented, they will fail on a 12.2 system without a 12.2 directory.
We have updated the Open MPI 4.1.5 build for the next release (23.9) to build against 12.2, but the HPC-X team is still building HPC-X against 12.1. We do regret any inconvenience caused by this confusion.
This is a known bug, will be fixed in 23.9.
For now, feel free to edit the modulefile to point to the 12.2 directory instead.
How is that helpful in an HPC environment?
Our expectation is that code is compiled on nodes that do not have Cuda libs (no GPUs) but will run on compute nodes with cuda / gpus. We’d prefer to setup modules and other things only with configurations supported by the system (and so users know what they are using / get appropriate optimizations — reporting exact library versions is often desired in publications and nvhpc changing what its using under the hood is not desire-able behavior for us — we’d prefer meaningful error messages if incompatible versions are mixed)