We are facing a critical issue with the GPUs on our DGX-1. None of the GPUs are listed in lspci | grep nvidia
and therefore, cannot be accessed by the OS.
OS: Ubuntu 16.04.3 LTS
DGX_SWBUILD_VERSION=“3.1.2”
The issue occurred after a GPU_Overtemp error caused the machine to shut down. When we restarted the machine, the GPUs were no longer accessible. Can someone suggest a fix? This is the entry in the SEL logs.
GPU_Overtemp | Temperature | State Asserted
PCIE Error | Critical Interrupt | Bus Fatal Error ; OEM Event Data2 code = 10h ; OEM Event Data3 code = 80h
N/A | N/A | OEM defined = 86h 80h 04h 6Fh 21h 24h