Hi, I think I too am suffering from this issue.
The symptoms are exact as OP describes. Things start moving slower, and slower, until things are so slow, they’re effectively graphically locked up. Videos and games both seem to be related. I haven’t nailed down a particularly way to make the crash occur. In this case, I had middle clicked in firefox to activate the dragged scroll, and I was on twitter. Perhaps an auto playing video got pulled in by the infinite scroll, or something else happened with firefox’s webrenderer?
The bug largely feels random, sometimes I’ll go a week without seeing it, sometimes I’ll see it multiple times in the same hour – in the latter case, it’s almost always (maybe always) preceded by a video or a game, though not any particular video or game consistently.
I tried mostly recently, and was able to get to a tty (the change between X and the console tty is very slow, but once on the tty things are full speed, including audio). Unfortunately, I wasn’t aware of the nvidia bug report script, so I’ll try and get that next time. I did however, dump dmesg from the tty. This is the relevant interesting portion:
[Apr23 12:44] NVRM: GPU at PCI:0000:0a:00: GPU-2dd471df-2353-145b-1ac7-ddae77f72306
[ +0.000004] NVRM: GPU Board Serial Number:
[ +0.000004] NVRM: Xid (PCI:0000:0a:00): 61, pid=1221, 0cec(3098) 00000000 00000000
[Apr23 12:45] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +11.998307] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:46] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +8.499499] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +5.011144] fbcon: Taking over console
[ +0.000086] Console: switching to colour frame buffer device 128x48
[Apr23 12:49] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:50] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +12.038590] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +8.498130] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +36.099874] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:51] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +36.088768] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +8.499709] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:55] spotify[13605]: segfault at 4 ip 00007f0e6e66ef07 sp 00007fffe94dd328 error 6 in libnvidia-glcore.so.440.82[7f0e6d601000+1814000]
[ +0.000007] Code: 04 01 00 00 44 89 ab 08 01 00 00 44 89 b3 0c 01 00 00 e9 5b ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 8b 44 24 08 83 c2 1a <c7> 46 04 e4 08 04 20 c1 e2 12 89 4e 08 44 89 46 0c 81 ca 00 0e 00
[ +8.069868] spotify[15835]: segfault at 4 ip 00007fad2d0d4f07 sp 00007fff90e12e48 error 6 in libnvidia-glcore.so.440.82[7fad2c067000+1814000]
[ +0.000007] Code: 04 01 00 00 44 89 ab 08 01 00 00 44 89 b3 0c 01 00 00 e9 5b ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 8b 44 24 08 83 c2 1a <c7> 46 04 e4 08 04 20 c1 e2 12 89 4e 08 44 89 46 0c 81 ca 00 0e 00
I’m going to break this down a bit. Based on my firefox history, my last google search, (I think right before I went to twitter) was 12:47. So, perhaps whatever this NVRM message at 12:44 is, built up to the major issue over the course of those 3 minutes, or maybe it’s unrelated.
WRT to the tty switch, this is a normal tty switch looks like for me in dmesg:
[Apr23 13:16] fbcon: Taking over console
[ +0.000134] Console: switching to colour frame buffer device 128x48
This “Lost display notification” stuff seems to be abnormal, and thus related – I believe both to the crash and the tty switch.
I also noticed an interesting phenomenon. If I pulled up top, typically a single program would have abnormally high CPU usage in comparison its typical behavior. If I left top, killed it, and came back, another GPU program would take it’s place. If I had to guess, programs are getting stuck in a loop trying to render, either outright failing (thought not crashing), or just moving extremely slowly.
Spotify, is one of the programs that went to the top, and interesting in the dmesg log here, you can see it died of a segfault in libnvidia-glcore.so.440.82
. Looking at my fish shell history, I pulled the exact time stamp I killed spotify:
# Thu 23 Apr 2020 12:55:44 PM EDT
kill -9 13605
So, killing the spotify process resulted in this segfault, this again is abnormal, especially the kill to result in a segfault inside of an nvidia library. Without the issue occurring, killing any one of spotify’s processes does not result in this error.
Hardware wise, this is a Ryzen 3950X system with a 2080 RTX card, running on KDE Neon (Ubuntu 18.04 LTS provides the base packages, Neon only provides Qt and KDE packages, so at a core system level, 18.04 LTS) using kernel 5.3.0-46-generic with nvidia 440.82.
Hopefully this is at least somewhat useful.