CUDA 10.2 on Linux: listing devices gives error 999

I am trying to update my CUDA installation from 10.0 (where everything worked well) to 10.2 on openSUSE Tumbleweed. The NVIDIA GPU is a secondary GPU, not driving the display. I use the official Tumbleweed RPMs for the NVIDIA GPU drivers.

I followed the official CUDA installation guide. Everything went well so far: I updated the NVIDIA drivers, installed CUDA 10.2 from the runfile (choosing not to install the driver from the CUDA installer, as I already have it from the RPMs), and nvidia-smi shows that everything is in order:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 660     Off  | 00000000:02:00.0 N/A |                  N/A |
| 32%   45C    P0    N/A /  N/A |      0MiB /  1999MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

I compiled the CUDA samples (using g+±7 as the compiler, the default is g+±9) successfully as well, so nvcc works well too. However, now that I try to list the GPUs, the result is error 999:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

And the same error is given by anything that tries to interface with CUDA at runtime. Any ideas as to how to debug this issue?

I got the same issue with 18.04LTS and CUDA 10.2

~/$ nvidia-smi
Fri Apr 10 10:56:00 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   34C    P8    17W / 280W |    353MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   32C    P8    11W / 280W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   31C    P8    10W / 280W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   28C    P8    10W / 280W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                           
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1446      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1594      G   /usr/bin/gnome-shell                          49MiB |
|    0      1821      G   /usr/lib/xorg/Xorg                           140MiB |
|    0      1950      G   /usr/bin/gnome-shell                          99MiB |
|    0      2650      G   /opt/teamviewer/tv_bin/TeamViewer             13MiB |
|    0      3451      G   ...uest-channel-token=10071424817305171968    25MiB |
+-----------------------------------------------------------------------------+=

Checking NVCC

~/$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Trying to run the CUDA sample

~/$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

No idea what this could be…

I am having the exact same problem with CUDA 10.2 on opensuse tumbleweed. Using the latest nvidia drivers 440.82. I compiled the CUDA samples with gcc-7 which worked. When calling ./deviceQuery I get the same error message.

However, I can run ./deviceQuery as root and after I doing this once it also works as a normal user. And magically, all of a sudden blender finds my CUDA device and everything is fine until next reboot. Then I have to run ./deviceQuery as root again to get CUDA to recognize my GPU.

1 Like

This worked for me! Thank you very much. Any idea why this is happening? I’d rather not have to call ./deviceQuery after every reboot…

Hm, yes, indeed, running it as root returns device information correctly.

I set the permissions of the computer to “secure” in YaST, and now running it as non-root results in an error:

cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL

That now is true also for nvidia-smi, as a regular user it gives me:

Failed to initialize NVML: Insufficient Permissions

Even if I add myself into the video group, it’s still the same. Running as root works. So it seems to be a permissions problem indeed…

I had the same problem and running as root solved it just as you said! I think this happened as a consequence of some updates I’ve done recently of some Nvidia libraries and software but I can’t be sure.