After around 10 hours of uptime I always eventually get a complete system lock up. I can’t always reproduce it in the same way, so far it’s been during UI animations like switching virtual desktops, right click popup, intellisense popup, or I come back to my desk and it’s happened without any recent input. Screen is completely frozen and no amount of keyboard shortcuts or monitor unplugging/plugging in will fix it. I can hear my system fans spinning up and desktop getting hotter while it’s stuck in this state. My only option is a hard reboot. Checking the syslog on reboot I see:
Sep 9 13:51:40 hostname kernel: [24368.754920] NVRM: GPU at PCI:0000:02:00: GPU-cb2bfd72-939a-a5b9-fd12-6e43a65a8972
Sep 9 13:51:40 hostname kernel: [24368.754924] NVRM: Xid (PCI:0000:02:00): 79, pid=22320, GPU has fallen off the bus.
Sep 9 13:51:40 hostname kernel: [24368.754927] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
Sep 9 13:51:40 hostname kernel: [24368.805040] NVRM: A GPU crash dump has been created. If possible, please run
Sep 9 13:51:40 hostname kernel: [24368.805040] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 9 13:51:40 hostname kernel: [24368.805040] NVRM: the NVIDIA kernel module is unloaded.
I’m attaching the bug report as instructed (I had to scrub some work hostnames and usernames):
nvidia-bug-report.zip (372.4 KB)
I also found another thread "PCIe Bus Error: severity=Corrected" on Jetson Nano talking about the problem could be caused by ASPM so I’ve set my kernel to load with pcie_aspm=off but the issue persists.
Update: If I put my computer to sleep at the end of the day the crash will always happen the next day. I’ve been shutting down at end of day now but sometimes I can’t even make it a day without crashing.