Driver Version: 580.105.08
Problem Description:
During normal operation, GPU3 (0000:c2:00.0) suddenly failed with the error:
Unable to determine the device handle for GPU3: 0000:C2:00.0: Unknown Error
dmesg revealed the critical error:
NVRM: GPU 0000:c2:00.0: GPU has fallen off the bus.
Troubleshooting Steps Attempted:
-
PCI Check:
lspcishowed all GPUs visible, withrev a1(notrev ff). -
GPU Reset: Both methods failed:
bash
复制
sudo nvidia-smi --gpu-reset -i 3 # Failed echo 1 > /sys/bus/pci/devices/0000:c2:00.0/reset # Failed -
Graceful Shutdown:
sudo shutdown -h nowhung indefinitely, likely due to the NVIDIA driver waiting for the unresponsive GPU. -
Forced Reboot: Successfully recovered using SysRq:
bash
复制
sudo sync && sleep 2 && echo b | sudo tee /proc/sysrq-triggerAfter reboot, all GPUs are now visible and functional.
Current Status:
-
System is back online, all 4 GPUs detected
-
Question: How to diagnose the root cause and prevent recurrence?
nvidia-bug-report.log.gz (1.7 MB)