We have a project that uses both OpenACC and native CUDA, so we use a build environment with the NVIDIA HPC compilers (here 21.2) and a version of CUDA (11.0) chosen for compatibility with other tools. This is fragile, and whether or not it works depends on the order in which nvhpc
and cuda
are loaded into the environment.
The error message is:
nvlink fatal : Input file ‘/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda//lib64/libcudadevrt.a:cuda_device_runtime.o’ newer than toolkit (112 vs 110) (target: sm_60)
which happens when:
- The
nvcc
binary comes from thenvhpc
installation, not thecuda
installation. CUDA_HOME
is set and points to thecuda
installation.
In this situation nvcc
reports that it will use CUDA 11.0:
which nvcc
/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/bin/nvcc
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
CUDA_HOME=/path/to/cuda-11.0.2-kb4wci
But a trivial example fails to compile with the above error:
echo ‘’ > dummy.cu && nvcc dummy.cu -dc -o dummy.cu.o && nvcc dummy.cu.o -o dummy
nvlink fatal : Input file ‘/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda//lib64/libcudadevrt.a:cuda_device_runtime.o’ newer than toolkit (112 vs 110)
This appears to be because nvcc
derives a library search path from its own location. If I add a -dryrun
to the second nvcc
invocation I see
#$ nvlink --arch=sm_52 --register-link-binaries=“…” -m64 -L"/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda//lib64"…
and this
/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/lib64
directory is a symbolic link to
/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.2/lib64
which is not from the 11.0 CUDA version specified by CUDA_HOME
and reported by nvcc --version
.
If the environment is changed so that nvcc
comes from the CUDA installation instead of the HPC SDK then it seems to work. On our system this means that module load nvhpc cuda
works but module load cuda nvhpc
does not.
For the sake of debugging, I note that just module load nvhpc
and module load cuda nvhpc
followed by unset CUDA_HOME
also avoid the error, although presumably this is because CUDA 11.2 is being used throughout.
At the end of this post I have included a small test script, which might need a little adaptation. Only the cuda_nvhpc
branch of the script gives the version mismatch error above.
What can we do to make this setup more robust? The behaviour of nvcc
here, where --version
reports one CUDA version but it links against another version, seems surprising.
nvhpc_version=21.2
first_module=unstable
for config in nvhpc nvhpc_cuda cuda_nvhpc_unset_cuda_home cuda_nvhpc
do
echo ${config}
module purge
if [[ ${config} == nvhpc ]];
then
module load ${first_module} nvhpc/${nvhpc_version}
elif [[ ${config} == nvhpc_cuda ]];
then
module load ${first_module} nvhpc/${nvhpc_version} cuda
elif [[ ${config} == cuda_nvhpc || ${config} == cuda_nvhpc_unset_cuda_home ]];
then
module load ${first_module} cuda nvhpc/${nvhpc_version}
fi
if [[ ${config} == cuda_nvhpc_unset_cuda_home ]];
then
unset CUDA_HOME
fi
echo which nvcc
which nvcc
echo nvcc --version
nvcc --version
echo "CUDA_HOME=${CUDA_HOME}"
echo '' > dummy.cu && nvcc dummy.cu -dc -o dummy.cu.o && nvcc dummy.cu.o -o dummy
nvcc dummy.cu.o -o dummy -dryrun
don