Occasionally missing V100-PCIE-32GB GPU

Hi,
hope you can help here.
Nvidia-smi can not recognizing one or two of the V100-PCIE-32GB GPUs occasionally. When I reboot 30 GPU servers(each server has 8 V100-PCIE-32GB GPUs), always 7 or 8 servers miss 1 or 2 GPUs. And each time I reboot, the servers which missing gpus, are not the same. This is one error output of the nvidia-smi and lspci command showing:

[root@bj02-compute-10e129e213e32 han]# lspci | grep -i nvidia
5a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
5e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
62:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
66:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
b5:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
b9:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
bd:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
c1:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
[root@bj02-compute-10e129e213e32 han]# nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-6c7c45ba-13f1-c09a-fa37-0329fbe03801)
GPU 1: Tesla V100-PCIE-32GB (UUID: GPU-a670b915-9dd1-9fa3-8b42-50ff3e0ad737)
GPU 2: Tesla V100-PCIE-32GB (UUID: GPU-305e70cb-47eb-00e2-562b-d843e100ef93)
GPU 3: Tesla V100-PCIE-32GB (UUID: GPU-3d05e857-0baf-bb81-7b0c-43cc4736bf7c)
GPU 4: Tesla V100-PCIE-32GB (UUID: GPU-696ceee1-2874-9800-c692-b65ae864c95f)
GPU 5: Tesla V100-PCIE-32GB (UUID: GPU-d41b178a-b5a2-ecf0-8707-97df5f752a6a)
GPU 6: Tesla V100-PCIE-32GB (UUID: GPU-eb99e23f-d87a-0c15-7c49-ac5edd6167f2)

Also, there is some log info in /var/log/messages, about this error:

nvidia 0000:5e:00.0: irq 447 for MSI/MSI-X
NVRM: GPU at 0000:5e:00.0 has software scheduler ENABLED with policy PGPU_SHARE.
NVRM: RmInitAdapter failed! (0x26:0x65:1106)
NVRM: rm_init_adapter failed for device bearing minor number 1

Some lspci tree info about the gpu:

[root@bj02-compute-10e129e213e12 ~]# lspci -t | grep -B2 -A4 5a
 |           \-08.2
 +-[0000:53]-+-00.0-[54-69]----00.0-[55-69]--+-04.0-[56-59]--
 |           |                               +-08.0-[5a-5d]----00.0
 |           |                               +-0c.0-[5e-61]----00.0
 |           |                               +-10.0-[62-65]----00.0
 |           |                               \-14.0-[66-69]----00.0
 |           +-05.0
[root@bj02-compute-10e129e213e12 ~]# lspci -t | grep -B2 -A4 b5
 +-[0000:ae]-+-00.0-[af-c4]----00.0-[b0-c4]--+-04.0-[b1-b4]--+-00.0
 |           |                               |               \-00.1
 |           |                               +-08.0-[b5-b8]----00.0
 |           |                               +-0c.0-[b9-bc]----00.0
 |           |                               +-10.0-[bd-c0]----00.0
 |           |                               \-14.0-[c1-c4]----00.0
 |           +-05.0

Thanks in advance.
Hanhaijiao
nvidia-bug-report.log.gz (3.09 MB)

That RminitAdapter failed message would normally point towards defective gpus but since ths seems to be randomly happen on different servers (did you check if always the same pci bus is affected?) this rather points to some pcie/power instability. Did you already check for a bios update? Otherwise, you should maybe contact Huawei since those servers are built to reliably run the gpus.
BTW, do you reboot all 30 servers at the same time? If so, does rebooting them one-by-one lead to the same issue? You should take instabilities of the external power into consideration.

Yes, we have rebooted them one-by-one, wait 60 second to start next server, also lead to the same issue.

Today we enabled ECC and didn’t found this issue. We used vgpu 8.0 to test this issue.

Hi, these days, the server vendor is checking this issue. Huawei found the vbios version as following:

# nvidia-smi -q | grep -i bios
    VBIOS Version                   : 88.00.48.00.02
    VBIOS Version                   : 88.00.48.00.02
    VBIOS Version                   : 88.00.48.00.02
    VBIOS Version                   : 88.00.48.00.02
    VBIOS Version                   : 88.00.48.00.02
    VBIOS Version                   : 88.00.48.00.02
    VBIOS Version                   : 88.00.48.00.02
    VBIOS Version                   : 88.00.48.00.02

Do you know the newest vbios version ?
Thanks in advance.

Teslas AFAIK are all manufactured by PNY and they rarely to never publish any VBIOS updates and server manufacturers like Huawei should know first. So asking you that is a dud.

Hi, after update the vbios from 88.00.48.00.02 to 88.00.7e.00.03, this problem is disappeared. Thank you very much.