RTX3080 throwing XID 79, XID 13, and XID 48 under various circumstances (games and ML pipelines)

asking for a friend–I have remote access to the computer and someone who can physically access it

Driver Version: 545.29.06
CUDA Version: 12.3
Kernel: 6.6.8-200.fc39.x86_64
OS: ublue-os/bazzite:latest
EVGA GeForce RTX 3080 12GB

My friend was trying to get Red Dead Redemption 2 working the other day but was experiencing crashes while running the game with Steam. The game would either load past the first cutscene and allow gameplay for a few minutes before crashing with XID 79, or the game would crash before the cutscene ended and the GPU would crash the same.

after trying to perform some debugging and stress testing, the GPU also crashes with XID 79 when trying to run gpu-burn GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test. the program will load and then immediately crash before the tests begin. Only a few days ago the stress test was working.

Colony Survival crashes with XID 79 when the windows key is pressed, which is bound to open the KDE start menu and effectively tabs-out the game.

I’m attaching debug logs captured from the red dead crash and gpu-burn crash instances.

nvidia-bug-report-3.log.gz (188.4 KB)
nvidia-bug-report.log.gz (473.1 KB)

I tried using the GPU for a ML pipeline which utilizes pytorch and pyannote. upon instantiating the pyannote pipeline, the GPU crashes with XID 13 and XID 48.
here is the bug report:
nvidia-bug-report-pytorch.log.gz (88.3 KB)

I’d say the PSU is on its way out. You can try limiting clocks running nvidia-smi -lgc 300,1200 and then start the game. If it runs stable, the psu is definitely nearly broken or too small.

So I’m afraid it is a power issue. Limiting the clocks as you said, rdr2 runs stable, and the stress test is able to complete as well. I slowly increased the clocks, and I was able to trigger a crash once I hit 1400.

The psu is an EVGA SuperNOVA 1000 G5, purchased about 9 months ago, so it should be more than capable, but perhaps its kaput. Now I’m wondering if its the apartment wiring causing the issue since other outlets in the apartment do not work…

Thank you for the help!

Good question, If other power outlets don’t work at all. Unstable power inlet to the psu sure might be an issue but I have no experience with that. In my location, if the psu is drawing more power than the power network supllies the fuse will trigger.

Previously I wrote: Unlikely. I suppose it’s a 1000W model from the name it should be fit to serve the 3080 so it’s either broken has other issues (like extreme ripple sensitivity). Please have it replaced if still under warranty or choose a different brand if you have to buy a replacement.

There don’t seem to be any other issues around the apartment when the GPU crashes, so now I doubt its the home wiring at all. Good point that it would probably trip a breaker!

Will check in on the warranty and have a new psu ordered. I’ll update once the new psu is in and whether that stabilizes performance :)

Thanks again