RTX 4090 - Xid 79 fell off the bus infrequently

lingwow · July 19, 2024, 6:39am

Over the last month or two i’ve infrequently been getting Xid 79 errors. It doesn’t have a pattern and I cannot replicate it. Though it has happened over a dozen times this quarter.

What i’ve done:

Replaced the motherboard with a brand new one.
Replaced the PSUs and ensured the PSUs are sufficiently powerful enough. (2000w). The power this PC runs on is also on its own circuit breaker, im not sure if that matters but this room was wired by an electrician to handle the power draw.
Swapped the slots where the GPU is in.
Setup various kernel command parameters: currently: (nouveau.modeset=0 nvidia-drm.modeset=0 pcie_aspm=off).
Turned off acpi in the bios
Updated the drivers to the latest version directly from the .run file, disabling Ubuntu’s own package driver manager.

Stress tests:

Stress-ng (for the system) running perfectly fine and can run it for hours.
memtest86 on the memory to ensure system memory is okay.
GPU Burn by Willicc
Pytorch-benchmark-volta to benchmark the GPU using ML workloads.
Tests are run for at least 3 hours with monitoring of the thermals done via nvidia-smi and exported to a grafana dashboard.

The general issue I cannot replicate by stress tests or by my usual workloads, making it impossible for me to trial and error fixes and confirm that it works. I can run stress tests for 6+ hours and not crash, work for a week or two with no problems then suddenly the GPU will fail while I’m out for Xid 79 with minimal information.

In terms of heat as far as I can see via nvidia-smi while doing a long-running stress tests, the GPU doesn’t get hotter than a 70 - 80c and I tried my best to ensure it gets adequate cooling.

Since the GPU crashed as I was writing this post, I’ve run the nvidia-bug-report.sh, I wasn’t able to upload the report to nvidia’s tickets or via email as it was over 250mb in size, I can snip/trim it and offer it to anyone who needs it.

I’m willing to try anything including experimental driver settings to get to the bottom of this, I need my GPU and if incase its not a hardware issue I’d like to fix it, I can’t afford to be out of work for a month if I need to RMA it and further more if they take it and cannot confirm the issues on their end too.

System specifications:

CPU: Ryzen Threadripper 5955wx
Motherboard: Asrock WRX80D8-2T
RAM: Kingston - KSM32RD4/32HDR - ECC 32GB modules.
NVME: Crucial 4TB P3 Plus.
GPUs: Zotac RTX 4090 Trinity OC. (Not using the OC functionality, left it stock). x6
Operating System: Ubuntu 22.04 LTS (Kernel: 5.15.0-113-generic)
Nvidia driver: 550.78 (Cuda: 12.4)
No riser cards, retimers, etc. The GPU’s plugged directly into the motherboard. - if this matters.

Thanks!

lingwow · August 18, 2024, 10:52pm

This appears to still be a problem:

[Sat Aug 17 12:05:06 2024] docker0: port 3(veth2cb8199) entered blocking state
[Sat Aug 17 12:05:06 2024] docker0: port 3(veth2cb8199) entered forwarding state
[Sun Aug 18 16:33:59 2024] NVRM: GPU at PCI:0000:61:00: GPU-775430f1-f075-c58d-f562-c11eddfc9b2a
[Sun Aug 18 16:33:59 2024] NVRM: Xid (PCI:0000:61:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Sun Aug 18 16:33:59 2024] NVRM: GPU 0000:61:00.0: GPU has fallen off the bus.
[Sun Aug 18 16:33:59 2024] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.

Completely out of ideas on this one.

lingwow · September 18, 2024, 5:20pm

Good news, I figured out the bug and solved it.

sahilmalhotra17 · September 23, 2024, 4:03am

Hi. I’m in the same boat.
Can you please help by sharing what was the bug/solution and how did you figure it out?

lingwow · September 26, 2024, 8:13pm

Check your power cables, mine had a faulty sense cable connection so a small bump or shake would cause the whole GPU to crash.

I replaced the cable and havent had a problem since. You might have luck simply re-seating yours?

system · October 10, 2024, 8:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
XID79 error occurs, GPU Fallen, with TensorRT running on RTXA4000 and X11 mainboard on Ubuntu18.04 Linux ubuntu	2	517	March 21, 2023
Xid 79, GPU has fallen off the bus. CUDA Programming and Performance	15	26478	August 13, 2023
Fix "Xid 79 GPU has fallen off the bus" already! Linux	1	1726	January 10, 2021
RTX 4090 - Xid79 Fall off the bus Linux linux	0	42	January 15, 2025
Issues with 3090 Linux hw , cuda , kernel , power	18	2679	January 10, 2022
Xid 79, GPU has fallen off the bus Linux	0	854	September 4, 2021
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus Linux	0	239	November 28, 2023
Xid 79 Error: RTX 4090 GPU Falls Off Bus with NVIDIA Driver 535.161.07 on Ubuntu 22.04 LTS Server Linux	1	744	April 9, 2024
Unable to determine the device handle for GPU Linux	14	10119	September 14, 2022
GPU keeps falling off the bus Linux	3	1393	September 4, 2019

RTX 4090 - Xid 79 fell off the bus infrequently

Related topics