Failed to run deviceQuery - cuda 10.2 Tesla V100

hello all

i followed the intstruction in nvidia wizard and installed drivers version 440 and cuda 10.2 on Centos 7

the driver loads fine and detects the GPU but i can’t run any CUDA sample application

running device query gives me:

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 802
-> system not yet initialized
Result = FAIL

but nvidia-smi shows healthy output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:33:00.0 Off |                    0 |
| N/A   31C    P0    29W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:36:00.0 Off |                    0 |
| N/A   30C    P0    28W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM3...  On   | 00000000:69:00.0 Off |                    0 |
| N/A   31C    P0    42W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM3...  On   | 00000000:6C:00.0 Off |                    0 |
| N/A   28C    P0    29W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM3...  On   | 00000000:DE:00.0 Off |                    0 |
| N/A   29C    P0    27W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM3...  On   | 00000000:E1:00.0 Off |                    0 |
| N/A   28C    P0    28W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM3...  On   | 00000000:F3:00.0 Off |                    0 |
| N/A   33C    P0    30W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM3...  On   | 00000000:F6:00.0 Off |                    0 |
| N/A   30C    P0    31W / 350W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

all the devices exist

$ ls /dev/nvidia*
/dev/nvidia0  /dev/nvidia2  /dev/nvidia4  /dev/nvidia6  /dev/nvidiactl       /dev/nvidia-uvm
/dev/nvidia1  /dev/nvidia3  /dev/nvidia5  /dev/nvidia7  /dev/nvidia-modeset  /dev/nvidia-uvm-tools

i tried a few suggestion from this page Installation Guide Linux :: CUDA Toolkit Documentation
but nothing helped

any suggestion is welcome

thanks

eventually i found my answer here:

i had to install nvidia-fabric manager, probably to handle the NVLINK stuff
https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/getting-started.html

1 Like