CUDA driver and runtime mismatch

I get this error while trying to run gromacs

NOTE: Detection of GPUs failed. The API reported:
      CUDA driver version is insufficient for CUDA runtime version
      GROMACS cannot run tasks on a GPU.

I am working on a shared computer with non-root access. I see these cuda versions in the system paths:

$ ls /usr/local/cuda -l
lrwxrwxrwx. 1 root root 9 Oct 17  2018 /usr/local/cuda -> cuda-10.0

But I have my own version in the home:

$ echo $LD_LIBRARY_PATH
/storage/users/mahmood/cuda-10.1.168/lib64
$ which nvcc
~/cuda-10.1.168/bin/nvcc
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

The output of nvidia-smi looks like

$ which nvidia-smi
/bin/nvidia-smi
$ nvidia-smi
Sat Apr 11 16:11:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   36C    P8    14W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980 Ti  Off  | 00000000:81:00.0 Off |                  N/A |
| 75%   82C    P2   164W / 250W |    819MiB /  6083MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also, deviceQuery works properly

$ ~/NVIDIA_CUDA-10.1_Samples/1_Utilities/deviceQuery/deviceQuery
/storage/users/mahmood/NVIDIA_CUDA-10.1_Samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1683 MHz (1.68 GHz)
  Memory Clock rate:                             5505 Mhz

And CUDA_VISIBLE_DEVICES is fine

$ echo $CUDA_VISIBLE_DEVICES
0

The configure command was

$ cmake .. -DCMAKE_INSTALL_PREFIX=/storage/users/mahmood/cactus/gromacs/gromacs-2019.4-1080ti/single -DGMX_GPU=on -DGMX_CUDA_TARGET_SM=61
...
-- Looking for NVIDIA GPUs present in the system
-- Number of NVIDIA GPUs detected: 2
-- Found CUDA: /storage/users/mahmood/cuda-10.1.168 (found suitable version "10.1", minimum required is "7.0")

So, everything looks normal. I wonder why the binary is unable to use the device 0?

Any idea for more debugging?

So, in the gromacs log I see

CUDA compiler:      /storage/users/mahmood/cuda-10.1.168/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Wed_Apr_24_19:10:27_PDT_2019;Cuda compilation tools, release 10.1, V10.1.168
CUDA compiler flags:-gencode;arch=compute_61,code=sm_61;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:        10.0
CUDA runtime:       N/A

I wonder why that is shown.

It seems that there is a global variable which uses 10.0 for the driver.
I can not find that. Any idea?

Every CUDA version has a minimum driver version that is required. You appear to have installed CUDA 10.1. Any executable produced with that will likewise require the same minimum driver version. The minimum driver version on Linux for CUDA 10.1 is:

https://stackoverflow.com/questions/30820513/what-is-the-correct-version-of-cuda-for-my-nvidia-driver

CUDA 10.1: 418.39

Your currently installed driver is 410.48, so that won’t work. Either upgrade the driver to at least the minimum version required by CUDA 10.1 (or a higher version of the driver), or revert to an earlier version of CUDA (it seems CUDA 10.0 should work, per the list).

1 Like