RTX 4090 - Xid 79 fell off the bus infrequently

Over the last month or two i’ve infrequently been getting Xid 79 errors. It doesn’t have a pattern and I cannot replicate it. Though it has happened over a dozen times this quarter.

What i’ve done:

  1. Replaced the motherboard with a brand new one.
  2. Replaced the PSUs and ensured the PSUs are sufficiently powerful enough. (2000w). The power this PC runs on is also on its own circuit breaker, im not sure if that matters but this room was wired by an electrician to handle the power draw.
  3. Swapped the slots where the GPU is in.
  4. Setup various kernel command parameters: currently: (nouveau.modeset=0 nvidia-drm.modeset=0 pcie_aspm=off).
  5. Turned off acpi in the bios
  6. Updated the drivers to the latest version directly from the .run file, disabling Ubuntu’s own package driver manager.

Stress tests:

  1. Stress-ng (for the system) running perfectly fine and can run it for hours.
  2. memtest86 on the memory to ensure system memory is okay.
  3. GPU Burn by Willicc
  4. Pytorch-benchmark-volta to benchmark the GPU using ML workloads.
  5. Tests are run for at least 3 hours with monitoring of the thermals done via nvidia-smi and exported to a grafana dashboard.

The general issue I cannot replicate by stress tests or by my usual workloads, making it impossible for me to trial and error fixes and confirm that it works. I can run stress tests for 6+ hours and not crash, work for a week or two with no problems then suddenly the GPU will fail while I’m out for Xid 79 with minimal information.

In terms of heat as far as I can see via nvidia-smi while doing a long-running stress tests, the GPU doesn’t get hotter than a 70 - 80c and I tried my best to ensure it gets adequate cooling.

Since the GPU crashed as I was writing this post, I’ve run the nvidia-bug-report.sh, I wasn’t able to upload the report to nvidia’s tickets or via email as it was over 250mb in size, I can snip/trim it and offer it to anyone who needs it.

I’m willing to try anything including experimental driver settings to get to the bottom of this, I need my GPU and if incase its not a hardware issue I’d like to fix it, I can’t afford to be out of work for a month if I need to RMA it and further more if they take it and cannot confirm the issues on their end too.

System specifications:

  • CPU: Ryzen Threadripper 5955wx
  • Motherboard: Asrock WRX80D8-2T
  • RAM: Kingston - KSM32RD4/32HDR - ECC 32GB modules.
  • NVME: Crucial 4TB P3 Plus.
  • GPUs: Zotac RTX 4090 Trinity OC. (Not using the OC functionality, left it stock). x6
  • Operating System: Ubuntu 22.04 LTS (Kernel: 5.15.0-113-generic)
  • Nvidia driver: 550.78 (Cuda: 12.4)
  • No riser cards, retimers, etc. The GPU’s plugged directly into the motherboard. - if this matters.

Thanks!

This appears to still be a problem:

[Sat Aug 17 12:05:06 2024] docker0: port 3(veth2cb8199) entered blocking state
[Sat Aug 17 12:05:06 2024] docker0: port 3(veth2cb8199) entered forwarding state
[Sun Aug 18 16:33:59 2024] NVRM: GPU at PCI:0000:61:00: GPU-775430f1-f075-c58d-f562-c11eddfc9b2a
[Sun Aug 18 16:33:59 2024] NVRM: Xid (PCI:0000:61:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Sun Aug 18 16:33:59 2024] NVRM: GPU 0000:61:00.0: GPU has fallen off the bus.
[Sun Aug 18 16:33:59 2024] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.

Completely out of ideas on this one.

Good news, I figured out the bug and solved it.

Hi. I’m in the same boat.
Can you please help by sharing what was the bug/solution and how did you figure it out?

Check your power cables, mine had a faulty sense cable connection so a small bump or shake would cause the whole GPU to crash.

I replaced the cable and havent had a problem since. You might have luck simply re-seating yours?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.