Missing GPU

Hi,

hope you can help here. nvidia-smi is not recognizing one of the Tesla V100 GPUs. This is the output of the nvidia-smi command showing only gpus 0 to 6:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:15:00.0 Off | Off |
| N/A 33C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:16:00.0 Off | Off |
| N/A 35C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000000:3B:00.0 Off | Off |
| N/A 48C P0 45W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000000:89:00.0 Off | Off |
| N/A 33C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 Tesla V100-SXM2… On | 00000000:8A:00.0 Off | Off |
| N/A 35C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 Tesla V100-SXM2… On | 00000000:B2:00.0 Off | Off |
| N/A 32C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 6 Tesla V100-SXM2… On | 00000000:B3:00.0 Off | Off |
| N/A 36C P0 43W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+

If i query the GPUs wtih nvidia-smi query i see the one at pci 3a:00.0 is missing
[root@ng009 ~]# for i in 0 1 2 3 4 5 6 7; do nvidia-smi -i $i -q|grep “Bus Id”;done
Bus Id : 00000000:15:00.0
Bus Id : 00000000:16:00.0
Bus Id : 00000000:3B:00.0
Bus Id : 00000000:89:00.0
Bus Id : 00000000:8A:00.0
Bus Id : 00000000:B2:00.0
Bus Id : 00000000:B3:00.0

These are the GPUs installed:

15:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
16:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3b:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
89:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
8a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b2:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b3:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)

All jobs i run on the GPUs make the server hang.
Find attached a bug report.

Thanks in advance
Ruben
nvidia-bug-report.log.gz (84 KB)

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Attched you can find the .gz file

br/
Ruben
nvidia-bug-report.log.gz (84 KB)

Looks broken:

[  136.302383] nvidia 0000:3a:00.0: irq 493 for MSI/MSI-X
[  136.647391] NVRM: RmInitAdapter failed! (0x26:0xffff:1125)
[  136.647406] NVRM: rm_init_adapter failed for device bearing minor number 2

Thanks !, I’ll have it checked and replaced.