Screen frozen from GPU crash after ~10 hours on Ubuntu 20.04 Dell XPS 8940 GeForce RTX 3070 latest driver 470.63.01

After around 10 hours of uptime I always eventually get a complete system lock up. I can’t always reproduce it in the same way, so far it’s been during UI animations like switching virtual desktops, right click popup, intellisense popup, or I come back to my desk and it’s happened without any recent input. Screen is completely frozen and no amount of keyboard shortcuts or monitor unplugging/plugging in will fix it. I can hear my system fans spinning up and desktop getting hotter while it’s stuck in this state. My only option is a hard reboot. Checking the syslog on reboot I see:

Sep 9 13:51:40 hostname kernel: [24368.754920] NVRM: GPU at PCI:0000:02:00: GPU-cb2bfd72-939a-a5b9-fd12-6e43a65a8972
Sep 9 13:51:40 hostname kernel: [24368.754924] NVRM: Xid (PCI:0000:02:00): 79, pid=22320, GPU has fallen off the bus.
Sep 9 13:51:40 hostname kernel: [24368.754927] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
Sep 9 13:51:40 hostname kernel: [24368.805040] NVRM: A GPU crash dump has been created. If possible, please run
Sep 9 13:51:40 hostname kernel: [24368.805040] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 9 13:51:40 hostname kernel: [24368.805040] NVRM: the NVIDIA kernel module is unloaded.

I’m attaching the bug report as instructed (I had to scrub some work hostnames and usernames):
nvidia-bug-report.zip (372.4 KB)

I also found another thread "PCIe Bus Error: severity=Corrected" on Jetson Nano talking about the problem could be caused by ASPM so I’ve set my kernel to load with pcie_aspm=off but the issue persists.

Update: If I put my computer to sleep at the end of the day the crash will always happen the next day. I’ve been shutting down at end of day now but sometimes I can’t even make it a day without crashing.

I haven’t experienced the crash since setting intel_idle.max_cstate=1 as a workaround.

It’s back and I can’t figure out what’s causing it.

Since this is a desktop system, as a first measure please check for a bios update. Furthermore, you can try reseating the graphics card in its pcie slot (i.e. pull it out, set it back in properly).
If that doesn’t help, please create a kernel log: right after the crash happens, reboot, then run
sudo journalctl -b-1 |grep kernel >kernel.txt
and attach that.

Thank you for the response. I did both of those things but already ran into the same crash again. :( Attached kernel.txt.

kernel.txt (1.5 MB)

This crash is still occuring even with the latest NVIDIA drivers. I think I’ll reach out to Dell support.

I guess that’s probably the best. Nothing unusual in the logs and a XID79 shouldn’t happen in a pre-built system.