Hi,
hope you can help here. nvidia-smi is not recognizing one of the Tesla V100 GPUs. This is the output of the nvidia-smi command showing only gpus 0 to 6:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:15:00.0 Off | Off |
| N/A 33C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:16:00.0 Off | Off |
| N/A 35C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000000:3B:00.0 Off | Off |
| N/A 48C P0 45W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000000:89:00.0 Off | Off |
| N/A 33C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 Tesla V100-SXM2… On | 00000000:8A:00.0 Off | Off |
| N/A 35C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 Tesla V100-SXM2… On | 00000000:B2:00.0 Off | Off |
| N/A 32C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 6 Tesla V100-SXM2… On | 00000000:B3:00.0 Off | Off |
| N/A 36C P0 43W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
If i query the GPUs wtih nvidia-smi query i see the one at pci 3a:00.0 is missing
[root@ng009 ~]# for i in 0 1 2 3 4 5 6 7; do nvidia-smi -i $i -q|grep “Bus Id”;done
Bus Id : 00000000:15:00.0
Bus Id : 00000000:16:00.0
Bus Id : 00000000:3B:00.0
Bus Id : 00000000:89:00.0
Bus Id : 00000000:8A:00.0
Bus Id : 00000000:B2:00.0
Bus Id : 00000000:B3:00.0
These are the GPUs installed:
15:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
16:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3b:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
89:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
8a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b2:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b3:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
All jobs i run on the GPUs make the server hang.
Find attached a bug report.
Thanks in advance
Ruben
nvidia-bug-report.log.gz (84 KB)