Regular Xorg freezes on KDE Manjaro

Hi,
I’m a Manjaro KDE user on a new PC with a GTX1660 Super, running latest Nvidia driver.
It works fine, but occasionally, and very unexpectedly, I get a freeze of the graphical interface. I cannot figure out the cause, it happens in very different situation, with the PC being idle or while working on something.
This is not a freeze of the system, just of the display. I can still hear the sound of the film being played for instance, but the image does not refresh. Once I was doing a video transcoding and even though the display was frozen, I left it running and it finished successfully (I had to reboot to check it).
When it happens, at first I can get some reaction, move the mouse a bit, try to close a window. I have a display refresh once every 5-10-20 seconds maybe. But then it get really stuck.
This gave me the time to check the system monitor and every time it’s the same: there are two running processes that are running full throttle each on its core: Xorg and irq/106-nvidia
For the rest, performance report is fine, all other cores are free, there is plenty of free memory.
Only a reboot solves it, until it happens again.

Any help in investigating this would be appreciated!
For the record, a strange thing I can see in the log is the following type of error:

[   280.154] (EE) client bug: timer event2 debounce: scheduled expiry is in the past (-2ms), your system is too slow

Not sure this has to do with what I’m seeing.
Thank you,

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Right, I had tried actually, but it wasn’t accepted. Here it is, I removed the .gz extension
nvidia-bug-report.log (217.2 KB)

Unfortunately, no errors logged. Did you run it right after the crash? If not, please wait until it crashes again and run it right after.

Here is a log not after a crash but after other issues that may be related?
X11 often fails to start after the login screen, its seems like it’s loading but really slowly then I end up on a black screen with just the mouse pointer and need to reboot because just going back to log screen and connecting again gives the same result.
Also, after such things happen, my desktop icons are all over the place, even outside of the monitor. So I can there is a failure in recognizing the resolution or the display in general (but then why would I get the mouse pointer?)
nvidia-bug-report2.log (572.2 KB)

Nothing noteworthy in the logs. Did you already check if just the display connection is flawed, by using a different cable/connector/monitor? Do you use any kind of adapter/converter on it?

Hi, I think I am having a similar issue using Ubuntu 19.10 with Kernel 5.3.0-46-generic on 1080 ti with nvidia-driver-435. Randomly over the last couple weeks, my desktop display freezes and I’m unable to use the mouse or keyboard. SSH works fine and I am able to remotely run the NVIDIA reporting tool as well as poweroff the host. It sometimes happens just after I enter my drive encryption password at boot, sometimes during a video-game, but seems to happen more often when using a browser. In all cases my syslog shows the message:

Apr 21 13:00:43 hostpc kernel: [ 3584.430531] NVRM: GPU at PCI:0000:01:00: GPU-bd7638f6-40d1-2ddd-0a8f-5ffbddd256b6
Apr 21 13:00:43 hostpc kernel: [ 3584.430561] NVRM: GPU Board Serial Number:
Apr 21 13:00:43 hostpc kernel: [ 3584.430566] NVRM: Xid (PCI:0000:01:00): 79, pid=1499, GPU has fallen off the bus.
Apr 21 13:00:43 hostpc kernel: [ 3584.430568] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Perhaps a seat-belt might help? :D

Attached are some logs. Two are from days ago and the other 2 are from the crashes that happened when trying to make this post today.

I’ll upgrade the kernel tomorrow, unless there is something else I can run to help diagnose the root cause.

This case seems to have the same problem: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus - #3 by gsakhel
In this case, upgrading to kernal 5.6 seemed to solve the problem: Display freeze at monitor turn on for a few seconds with NVIDIA 440.59 - #7 by mozo

nvidia-bug-report_days_ago.log (2.2 MB) nvidia-bug-report_days_ago_1.log (2.0 MB) nvidia-bug-report_today_0.log (2.0 MB) nvidia-bug-report_today_1.log (2.1 MB)

I believe I’m seeing a very similar problem as the others with Fedora 31, NVIDIA driver 440.82, and Xorg 1.20.6. My system has a Titan RTX. It just randomly freezes the display about once a week and I have to SSH from another system to do a reboot. The audio from WebEx continues to work and I can move the mouse, but nothing else responds. Here is what appears to be relevant lines from the kernel (from journalctl).

NVRM: GPU at PCI:0000:08:00: GPU-5c7bd6dd-22ca-43c3-871a-ec88ae1cf126
NVRM: GPU Board Serial Number: 0324918077010
NVRM: Xid (PCI:0000:08:00): 61, pid=1586, 0cec(3098) 00000000 00000000
NVRM: Xid (PCI:0000:08:00): 8, pid=1586, Channel 00000018
/usr/libexec/gdm-x-session[1584]: (WW) NVIDIA: Wait for channel idle timed out.
/usr/libexec/gdm-x-session[1584]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00008670, 0x00008678)
GpuWatchdog[2927]: segfault at 0 ip 000055a0e38235b0 sp 00007f0f321a54e0 error 6 in chrome[55a0df4b8000+7347000]
Code: 3d 30 76 fb fa be 01 00 00 00 ba 07 00 00 00 e8 16 06 72 fe 48 8d 3d 18 b4 fc fa be 01 00 00 00 ba 03 00 00 00 e8 00 06 72 fe <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 56 9e 96 03 01 80 7d 87 00
/usr/libexec/gdm-x-session[1584]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00008670, 0x00008678)

To answer the question, nothing special in terms of connection, regular DP-DP cable, I could try another but it’s working fine all the time when the display is started correctly.

Hi, I think I too am suffering from this issue.

The symptoms are exact as OP describes. Things start moving slower, and slower, until things are so slow, they’re effectively graphically locked up. Videos and games both seem to be related. I haven’t nailed down a particularly way to make the crash occur. In this case, I had middle clicked in firefox to activate the dragged scroll, and I was on twitter. Perhaps an auto playing video got pulled in by the infinite scroll, or something else happened with firefox’s webrenderer?

The bug largely feels random, sometimes I’ll go a week without seeing it, sometimes I’ll see it multiple times in the same hour – in the latter case, it’s almost always (maybe always) preceded by a video or a game, though not any particular video or game consistently.

I tried mostly recently, and was able to get to a tty (the change between X and the console tty is very slow, but once on the tty things are full speed, including audio). Unfortunately, I wasn’t aware of the nvidia bug report script, so I’ll try and get that next time. I did however, dump dmesg from the tty. This is the relevant interesting portion:

[Apr23 12:44] NVRM: GPU at PCI:0000:0a:00: GPU-2dd471df-2353-145b-1ac7-ddae77f72306
[  +0.000004] NVRM: GPU Board Serial Number: 
[  +0.000004] NVRM: Xid (PCI:0000:0a:00): 61, pid=1221, 0cec(3098) 00000000 00000000
[Apr23 12:45] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +11.998307] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:46] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[  +8.499499] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[  +5.011144] fbcon: Taking over console
[  +0.000086] Console: switching to colour frame buffer device 128x48
[Apr23 12:49] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:50] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +12.038590] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[  +8.498130] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +36.099874] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:51] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ +36.088768] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[  +8.499709] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[Apr23 12:55] spotify[13605]: segfault at 4 ip 00007f0e6e66ef07 sp 00007fffe94dd328 error 6 in libnvidia-glcore.so.440.82[7f0e6d601000+1814000]
[  +0.000007] Code: 04 01 00 00 44 89 ab 08 01 00 00 44 89 b3 0c 01 00 00 e9 5b ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 8b 44 24 08 83 c2 1a <c7> 46 04 e4 08 04 20 c1 e2 12 89 4e 08 44 89 46 0c 81 ca 00 0e 00
[  +8.069868] spotify[15835]: segfault at 4 ip 00007fad2d0d4f07 sp 00007fff90e12e48 error 6 in libnvidia-glcore.so.440.82[7fad2c067000+1814000]
[  +0.000007] Code: 04 01 00 00 44 89 ab 08 01 00 00 44 89 b3 0c 01 00 00 e9 5b ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 8b 44 24 08 83 c2 1a <c7> 46 04 e4 08 04 20 c1 e2 12 89 4e 08 44 89 46 0c 81 ca 00 0e 00

I’m going to break this down a bit. Based on my firefox history, my last google search, (I think right before I went to twitter) was 12:47. So, perhaps whatever this NVRM message at 12:44 is, built up to the major issue over the course of those 3 minutes, or maybe it’s unrelated.

WRT to the tty switch, this is a normal tty switch looks like for me in dmesg:

[Apr23 13:16] fbcon: Taking over console
[  +0.000134] Console: switching to colour frame buffer device 128x48

This “Lost display notification” stuff seems to be abnormal, and thus related – I believe both to the crash and the tty switch.

I also noticed an interesting phenomenon. If I pulled up top, typically a single program would have abnormally high CPU usage in comparison its typical behavior. If I left top, killed it, and came back, another GPU program would take it’s place. If I had to guess, programs are getting stuck in a loop trying to render, either outright failing (thought not crashing), or just moving extremely slowly.

Spotify, is one of the programs that went to the top, and interesting in the dmesg log here, you can see it died of a segfault in libnvidia-glcore.so.440.82. Looking at my fish shell history, I pulled the exact time stamp I killed spotify:

# Thu 23 Apr 2020 12:55:44 PM EDT
kill -9 13605

So, killing the spotify process resulted in this segfault, this again is abnormal, especially the kill to result in a segfault inside of an nvidia library. Without the issue occurring, killing any one of spotify’s processes does not result in this error.

Hardware wise, this is a Ryzen 3950X system with a 2080 RTX card, running on KDE Neon (Ubuntu 18.04 LTS provides the base packages, Neon only provides Qt and KDE packages, so at a core system level, 18.04 LTS) using kernel 5.3.0-46-generic with nvidia 440.82.

Hopefully this is at least somewhat useful.

I got a nvidia-bug-report.log.gz this time, and I’ve emailed it to the provided email. Interestingly this time, I was unable to gracefully shutdown my system, otherwise, very similar results, including the Lost display notification.

So I was about to write I had not had the problem in a week and thought maybe a kernel or driver update fixed it, but then it just happened again!
The freeze left me unable to do anything and I had to hard reboot.
Here is the log made right after the reboot, hope there is something to be seen in there…
Thanks,

nvidia-bug-report3.log (543.0 KB)

I have filed a bug 200614112 internally for tracking purpose.
Will try to attempt repro and may reach to you again if required more information.

Thank you,
Was there anything worth noting in the log then? Maybe something I can try in the meantime?

Logs doesn’t have much detailed information to root cause issue.
It would be great to have concrete and reliable repro steps so that I can try the same.

Hi, unfortunately, I cannot reproduce it myself clearly. I worked for a week on the computer, all day, and had no problems, but then sometimes it happens twice in a few hours.
The only regularity I can mention (although it’s not always the case) is that it’s often when playing a video. It can be from a web browser or a video player like VLC, does not matter. Maybe it was on fullscreen most of the time, I’m not sure.
But I often watch videos (fullscreen or not) and NOT get the freeze, so I’m not sure why sometimes it would cause it.

Adding to the previous comment, just had a freeze while not watching a video, no doing much.
Log attached, this time the log was taken just as the freeze announced itself: ultra laggy system, Xorg and irq-nvidia processes running on a full core each, as usual. System doing OK for the rest.
nvidia-bug-report4.log (1.0 MB)
Not sure if something about what is going on at that time can be seen in there.

I think some of the reports here referencing Xid 61 might be the same as what we were seeing in this thread:

Thank you, indeed it seems to be the same issue.
Not really reassuring that it’s been reported by so many people and that it’s been open for months…

I am still trying to recreate issue locally but no luck so far.
Would be great to know if someone finds reliable steps to reproduce it.
Also it would be worth updating BIOS if it is not up to date.