Over the last month or two i’ve infrequently been getting Xid 79 errors. It doesn’t have a pattern and I cannot replicate it. Though it has happened over a dozen times this quarter.
What i’ve done:
- Replaced the motherboard with a brand new one.
- Replaced the PSUs and ensured the PSUs are sufficiently powerful enough. (2000w). The power this PC runs on is also on its own circuit breaker, im not sure if that matters but this room was wired by an electrician to handle the power draw.
- Swapped the slots where the GPU is in.
- Setup various kernel command parameters: currently: (nouveau.modeset=0 nvidia-drm.modeset=0 pcie_aspm=off).
- Turned off acpi in the bios
- Updated the drivers to the latest version directly from the
.run
file, disabling Ubuntu’s own package driver manager.
Stress tests:
- Stress-ng (for the system) running perfectly fine and can run it for hours.
- memtest86 on the memory to ensure system memory is okay.
- GPU Burn by Willicc
- Pytorch-benchmark-volta to benchmark the GPU using ML workloads.
- Tests are run for at least 3 hours with monitoring of the thermals done via nvidia-smi and exported to a grafana dashboard.
The general issue I cannot replicate by stress tests or by my usual workloads, making it impossible for me to trial and error fixes and confirm that it works. I can run stress tests for 6+ hours and not crash, work for a week or two with no problems then suddenly the GPU will fail while I’m out for Xid 79 with minimal information.
In terms of heat as far as I can see via nvidia-smi while doing a long-running stress tests, the GPU doesn’t get hotter than a 70 - 80c and I tried my best to ensure it gets adequate cooling.
Since the GPU crashed as I was writing this post, I’ve run the nvidia-bug-report.sh
, I wasn’t able to upload the report to nvidia’s tickets or via email as it was over 250mb in size, I can snip/trim it and offer it to anyone who needs it.
I’m willing to try anything including experimental driver settings to get to the bottom of this, I need my GPU and if incase its not a hardware issue I’d like to fix it, I can’t afford to be out of work for a month if I need to RMA it and further more if they take it and cannot confirm the issues on their end too.
System specifications:
- CPU: Ryzen Threadripper 5955wx
- Motherboard: Asrock WRX80D8-2T
- RAM: Kingston - KSM32RD4/32HDR - ECC 32GB modules.
- NVME: Crucial 4TB P3 Plus.
- GPUs: Zotac RTX 4090 Trinity OC. (Not using the OC functionality, left it stock). x6
- Operating System: Ubuntu 22.04 LTS (Kernel: 5.15.0-113-generic)
- Nvidia driver: 550.78 (Cuda: 12.4)
- No riser cards, retimers, etc. The GPU’s plugged directly into the motherboard. - if this matters.
Thanks!