GPU has fallen off the bus

I would like to know why the gpu dies.
During deep learning learning, GPUs die.
When a problem occurs, I need to reboot to recover.

The server specifications are as follows.
CPU : Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
motherboard : X11DPX-T
gpu : NVIDIA RTX 3090 24GB, Water Cooled * 8ea


nvidia-bug-report.log.gz (1.5 MB)

As always:

  • check/replace your PSU
  • check GPU fans
  • reseat your GPU(s)
  • test under Windows if possible
  • remove overclocking (if you’re using it)

The RTX 30 series may have very huge spikes in power consumption, so having a beefy PSU is a must.

https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4

Xid 79 has multiple reasons to occur, including hw related errors.

So better check those, it might be caused by faulty hw.

GPU falling off the bus is often a bios, power supply or thermal issue. Can you please update BIOS and also verify system has adequate power supply and no thermal issue.

I want to raise to your attention that this is neither a BIOS nor a PS issue, the same cards on the same workstations, with the same BIOS version and same power supply work as expected on older Distro/Kernels…
Anyway, it’s doesn’t seem to be a driver issue too! And it happens even when the NVidia driver isn’t installed!

However, the issue requires some attention from your engineers to figure out why the RTX cards have such issue with the recent linux distro! It’s your product at the end, and we expect you to troubleshoot the issue and tell us what to do!

FYI, all the issue is not something new! the intel idle c_state is behind it, loading the kernel with idle=nomwai (which disables the intel idle driver and uses the ACPI driver instead fixes the issue) however, this is consuming to much energy and makes the workstation really noisy and probably hotter!

So, this is the case, if you will continue assuming that it’s something wrong with our hardware, the issue will never get resolved! If there is something wrong on our hardware, then it’s your PCIe cards design lacking some sort of power regulator, or requires a firmware update that allows the card to deal with the power reduction that happens when the system kernel activate the processor idle via the c_state levels.

1 Like