Hi,
hope you can help here.
Nvidia-smi can not recognizing one or two of the V100-PCIE-32GB GPUs occasionally. When I reboot 30 GPU servers(each server has 8 V100-PCIE-32GB GPUs), always 7 or 8 servers miss 1 or 2 GPUs. And each time I reboot, the servers which missing gpus, are not the same. This is one error output of the nvidia-smi and lspci command showing:
[root@bj02-compute-10e129e213e32 han]# lspci | grep -i nvidia
5a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
5e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
62:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
66:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
b5:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
b9:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
bd:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
c1:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
[root@bj02-compute-10e129e213e32 han]# nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-6c7c45ba-13f1-c09a-fa37-0329fbe03801)
GPU 1: Tesla V100-PCIE-32GB (UUID: GPU-a670b915-9dd1-9fa3-8b42-50ff3e0ad737)
GPU 2: Tesla V100-PCIE-32GB (UUID: GPU-305e70cb-47eb-00e2-562b-d843e100ef93)
GPU 3: Tesla V100-PCIE-32GB (UUID: GPU-3d05e857-0baf-bb81-7b0c-43cc4736bf7c)
GPU 4: Tesla V100-PCIE-32GB (UUID: GPU-696ceee1-2874-9800-c692-b65ae864c95f)
GPU 5: Tesla V100-PCIE-32GB (UUID: GPU-d41b178a-b5a2-ecf0-8707-97df5f752a6a)
GPU 6: Tesla V100-PCIE-32GB (UUID: GPU-eb99e23f-d87a-0c15-7c49-ac5edd6167f2)
Also, there is some log info in /var/log/messages, about this error:
nvidia 0000:5e:00.0: irq 447 for MSI/MSI-X
NVRM: GPU at 0000:5e:00.0 has software scheduler ENABLED with policy PGPU_SHARE.
NVRM: RmInitAdapter failed! (0x26:0x65:1106)
NVRM: rm_init_adapter failed for device bearing minor number 1
Some lspci tree info about the gpu:
[root@bj02-compute-10e129e213e12 ~]# lspci -t | grep -B2 -A4 5a
| \-08.2
+-[0000:53]-+-00.0-[54-69]----00.0-[55-69]--+-04.0-[56-59]--
| | +-08.0-[5a-5d]----00.0
| | +-0c.0-[5e-61]----00.0
| | +-10.0-[62-65]----00.0
| | \-14.0-[66-69]----00.0
| +-05.0
[root@bj02-compute-10e129e213e12 ~]# lspci -t | grep -B2 -A4 b5
+-[0000:ae]-+-00.0-[af-c4]----00.0-[b0-c4]--+-04.0-[b1-b4]--+-00.0
| | | \-00.1
| | +-08.0-[b5-b8]----00.0
| | +-0c.0-[b9-bc]----00.0
| | +-10.0-[bd-c0]----00.0
| | \-14.0-[c1-c4]----00.0
| +-05.0
Thanks in advance.
Hanhaijiao
nvidia-bug-report.log.gz (3.09 MB)