[Multi-GPU] 4x RTX 3090 "GPU has fallen off the bus" - Reset Failed, Shutdown Hung, Recovered via SysRq

Driver Version: 580.105.08

Problem Description:

During normal operation, GPU3 (0000:c2:00.0) suddenly failed with the error:

Unable to determine the device handle for GPU3: 0000:C2:00.0: Unknown Error

dmesg revealed the critical error:

NVRM: GPU 0000:c2:00.0: GPU has fallen off the bus.

Troubleshooting Steps Attempted:

  1. PCI Check: lspci showed all GPUs visible, with rev a1 (not rev ff).

  2. GPU Reset: Both methods failed:

    bash

    复制

    sudo nvidia-smi --gpu-reset -i 3  # Failed
    echo 1 > /sys/bus/pci/devices/0000:c2:00.0/reset  # Failed
    
  3. Graceful Shutdown: sudo shutdown -h now hung indefinitely, likely due to the NVIDIA driver waiting for the unresponsive GPU.

  4. Forced Reboot: Successfully recovered using SysRq:

    bash

    复制

    sudo sync && sleep 2 && echo b | sudo tee /proc/sysrq-trigger
    

    After reboot, all GPUs are now visible and functional.

Current Status:

  • System is back online, all 4 GPUs detected

  • Question: How to diagnose the root cause and prevent recurrence?

    nvidia-bug-report.log.gz (1.7 MB)

Welcome @amcsyihonglin to the NVIDIA developer forums.

I moved your post to the dedicated Linux category, they are more active with respect to similar issues like yours.

Given that it is an intermittent failure this can have a lot of reasons. Faulty Hardware, temperature issues (4 RTX3090 produce a lot of heat), power delivery. I would make sure those things are not the reason, or if so, address them first.

Thanks!

Hi, the cooling components of gpu has recently replaced and the server’s power supply is 2400W, but the error still exists.

Check this post: Maximum power draw 3090

Official TDP of a 3090 is 350W if I remember correctly, so your PSU should be ok, but it would not hurt to somehow monitor peak power delivery, see if there are any spikes. Also check whether the GPUs are well distributed over the PCIe power rails.

Thanks for your suggestion, I will try it.