Hi,
We have a DGX-1 server with 8 V100 GPU’s. Recently the system shut down unexpectedly and in the BMC and SEL logs we found a GPU_Overtemp event after which the system was powered down.
After powering the system on again the GPU’s are no longer available on the PCIE bus, lspci does not show the GPU’s.
extract of the SEL:
[Information] [Power Unit] [Power Unit] Power Off / Power Down - Asserted
[Critical] [Critical IRQ] [Critical Interrupt] Software NMI - Asserted
[Critical] [HSC2_Alert] [Power Unit] State Asserted - Asserted
[Critical] [Critical IRQ] [Critical Interrupt] Software NMI - Asserted
[Critical] [Critical IRQ] [Critical Interrupt] Software NMI - Asserted
[Critical] [OEM Record c0] [001c4c] ManufacturerID:001C4C/ VID:8086/ DID:6F04/ ErrorID 1:21/ ErrorID 2:24
[Critical] [PCIE Error] [Critical Interrupt] Bus Fatal (Bus0/Dev2/Fun0) - Asserted
[Critical] [GPU_Overtemp] [Temperature] State Asserted (0x00) - Asserted
Since DGX-1 systems are out of support we can no longer ask Nvidia support for help on this so I am hoping anyone on this forum has a suggestion. It is very possible that the GPU tray has a hardware defect. We did a visual inspection of the GPU tray and found nothing obvious. The syslog of the system does not show any PCIE errors which is what I would expect to see if there is a defect. Does anyone have any suggestions of things to try? Things to reset? Also, is anyone aware of common defects when GPU’s are no longer showing up on the PCI bus on a DGX-1 system?