Nvcc only partially respects CUDA_HOME ("Input file newer than toolkit")

We have a project that uses both OpenACC and native CUDA, so we use a build environment with the NVIDIA HPC compilers (here 21.2) and a version of CUDA (11.0) chosen for compatibility with other tools. This is fragile, and whether or not it works depends on the order in which nvhpc and cuda are loaded into the environment.

The error message is:

nvlink fatal : Input file ‘/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda//lib64/libcudadevrt.a:cuda_device_runtime.o’ newer than toolkit (112 vs 110) (target: sm_60)

which happens when:

  • The nvcc binary comes from the nvhpc installation, not the cuda installation.
  • CUDA_HOME is set and points to the cuda installation.

In this situation nvcc reports that it will use CUDA 11.0:

which nvcc
/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/bin/nvcc

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

CUDA_HOME=/path/to/cuda-11.0.2-kb4wci

But a trivial example fails to compile with the above error:

echo ‘’ > dummy.cu && nvcc dummy.cu -dc -o dummy.cu.o && nvcc dummy.cu.o -o dummy
nvlink fatal : Input file ‘/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda//lib64/libcudadevrt.a:cuda_device_runtime.o’ newer than toolkit (112 vs 110)

This appears to be because nvcc derives a library search path from its own location. If I add a -dryrun to the second nvcc invocation I see

#$ nvlink --arch=sm_52 --register-link-binaries=“…” -m64 -L"/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda//lib64"…

and this

/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/lib64

directory is a symbolic link to

/path/to/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.2/lib64

which is not from the 11.0 CUDA version specified by CUDA_HOME and reported by nvcc --version.

If the environment is changed so that nvcc comes from the CUDA installation instead of the HPC SDK then it seems to work. On our system this means that module load nvhpc cuda works but module load cuda nvhpc does not.

For the sake of debugging, I note that just module load nvhpc and module load cuda nvhpc followed by unset CUDA_HOME also avoid the error, although presumably this is because CUDA 11.2 is being used throughout.

At the end of this post I have included a small test script, which might need a little adaptation. Only the cuda_nvhpc branch of the script gives the version mismatch error above.

What can we do to make this setup more robust? The behaviour of nvcc here, where --version reports one CUDA version but it links against another version, seems surprising.

nvhpc_version=21.2
first_module=unstable
for config in nvhpc nvhpc_cuda cuda_nvhpc_unset_cuda_home cuda_nvhpc
do
  echo ${config}
  module purge
  if [[ ${config} == nvhpc ]];
  then
    module load ${first_module} nvhpc/${nvhpc_version}
  elif [[ ${config} == nvhpc_cuda ]];
  then
    module load ${first_module} nvhpc/${nvhpc_version} cuda
  elif [[ ${config} == cuda_nvhpc || ${config} == cuda_nvhpc_unset_cuda_home ]];
  then
    module load ${first_module} cuda nvhpc/${nvhpc_version}
  fi
  if [[ ${config} == cuda_nvhpc_unset_cuda_home ]];
  then
    unset CUDA_HOME
  fi
  echo which nvcc
  which nvcc
  echo nvcc --version
  nvcc --version
  echo "CUDA_HOME=${CUDA_HOME}"
  echo '' > dummy.cu && nvcc dummy.cu -dc -o dummy.cu.o && nvcc dummy.cu.o -o dummy
  nvcc dummy.cu.o -o dummy -dryrun
don