GPU has fallen off the bus

I would like to know why the gpu dies.
During deep learning learning, GPUs die.
When a problem occurs, I need to reboot to recover.

The server specifications are as follows.
CPU : Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
motherboard : X11DPX-T
gpu : NVIDIA RTX 3090 24GB, Water Cooled * 8ea


nvidia-bug-report.log.gz (1.5 MB)

As always:

  • check/replace your PSU
  • check GPU fans
  • reseat your GPU(s)
  • test under Windows if possible
  • remove overclocking (if you’re using it)

The RTX 30 series may have very huge spikes in power consumption, so having a beefy PSU is a must.

https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4

Xid 79 has multiple reasons to occur, including hw related errors.

So better check those, it might be caused by faulty hw.

GPU falling off the bus is often a bios, power supply or thermal issue. Can you please update BIOS and also verify system has adequate power supply and no thermal issue.