Driver-470.103.01 hard freezes Ubuntu 20.04.4 with GPU reset

I’m havng very similar problems to those report by others at at least three other threads on this site that the interface would not permit me to create links to because I’m a “new user”. One at least tries to be helpful.

The symptoms include multiple (6-10) daily hard GUI freezes at random intervals and times, and not obviously related to any particular user input or activity.

The logs reveal several of the same symptoms, including xorg hangs,

A common pattern is the message:

Ignored exception from dbus method: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name com.gonzaarcr.appmenu was not provided by any .service files

followed by several dozen to several hundred pulseaudio messages as it loops complaining of latency issues. The message loop looks like:

Apr 21 12:59:26 | user.debug | greystone | PN=pulseaudio | ST=pulseaudio[1965]: | AN=pulseaudio | MSG= max_request changed, trying to update from 2340 to 3222.
Apr 21 12:59:26 | user.debug | greystone | PN=pulseaudio | ST=pulseaudio[1965]: | AN=pulseaudio | MSG= max_request changed, trying to update from 2340 to 3222.
Apr 21 12:59:26 | user.debug | greystone | PN=pulseaudio | ST=pulseaudio[1965]: | AN=pulseaudio | MSG= max_request changed, trying to update from 2340 to 3222.
Apr 21 12:59:26 | user.debug | greystone | PN=pulseaudio | ST=pulseaudio[1965]: | AN=pulseaudio | MSG= Notifying client about increased tlength
Apr 21 12:59:26 | user.debug | greystone | PN=pulseaudio | ST=pulseaudio[1965]: | AN=pulseaudio | MSG= hwbuf_unused=346452

These conclude with these two messages shortly or immediately before the message indicating a reboot:

Apr 21 12:59:33 | user.debug | greystone | PN=pulseaudio | ST=pulseaudio[1965]: | AN=pulseaudio | MSG= Latency set to 46.00ms
Apr 21 12:59:33 | user.debug | greystone | PN=pulseaudio | ST=pulseaudio[1965]: | AN=pulseaudio | MSG= setting avail_min=87055

Other common messages are:

Apr 23 15:53:43 | user.warning | greystone | PN= | ST=/usr/lib/gdm3/gdm-x-session[2002]: | AN=- | MSG= (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x0000f884, 0x0000cc64)

Apr 21 14:57:27 | daemon.debug | greystone | PN=rtkit-daemon | ST=rtkit-daemon[1606]: | AN=rtkit-daemon | MSG= Supervising 5 threads of 3 processes of 2 users.

Apr 21 14:58:03 | user.warning | greystone | PN= | ST=/usr/lib/gdm3/gdm-x-session[2101]: | AN=- | MSG= (WW) NVIDIA: Wait for channel idle timed out.

Apr 23 09:49:01 | daemon.info | greystone | PN=gsd-media-keys | ST=gsd-media-keys[3943]: | AN=gsd-media-keys | MSG= [GFX1]: Device reset due to WR context

The reboot after the last entry entailed thousands, if not tens of thousands of pulseaudio messages, looping through setup.

I should add that, during this, no sound is playing.

What I think is going on is either (a) the card’s HDMI-related DSP is preventing pulseaudio from connecting to the default sound device, which is part of the motherboard chipset, or (b) because nvidiafb is not compiled into the kernel, the card’s memory management fails.

I also should note that I’ve tried setting nvidia-drm.modeset=1 nouveau.modeset=0 and the periodic hangs/waits/resets persist.
nvidia-bug-report.log.gz (270.4 KB)

Any thoughts or assistance would be most welcome, I’m on day 12 of a Ubuntu 20.04 re-install, am wearing out my welcome at Google, and would like to be able to get back to work.

You’re always running into an XID 62, I guess your nvidia gpu is beginning to break.

Many thanks. That’s incredibly helpful if that’s the case and would explain much.

I’ll have another look at the logs but tell me what you’re looking at. IIRC, the card’s IRQ in its current slot also is 62 and I want to make sure I’m grep-ing the right string.

Also, I gather that an XID is an error code, from xorg, the driver, or the firmware. If so, are these enumerated anywhere? I haven’t seen any references to such and obviously would have pursued this angle if I had. It would be interesting to know if there are any others and if so, what they mean.

XIDs are nvidia driver/gpu error codes
https://docs.nvidia.com/deploy/xid-errors/index.html

They’re visible in dmesg after a crash occurs. Or check journal after a reboot
sudo journalctl -b -1 |grep -i XID

Edit: case insensitivity added.

First of all, thanks. This is hugely illuminating.

Second, and more of an aside, none of this is conspicuous, meaning easy to find. Maybe someone could throw together a “Troubleshooting NVIDIA Linux Driver Issues” page that’s tagged so the search engines pick it up, and make mention.

Third, and to the point, I had a look at syslog, which picks up dmesg, and the XID 62 messages correlate directly with every freeze I had yesterday.

So far, so good, but each of these is preceded by several, typically 5, messages just like:

13<1>Apr 24 13:35:13 | user.notice | PN=gnome-shell | ST=gnome-shell[4137]: | AN=gnome-shell | MSG= Ignored exception from dbus method: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name com.gonzaarcr.appmenu was not provided by any .service files

The XID reference describes the likely causes of XID 62 to be hardware (as you suggested) but also the driver or a thermal issue. I’m pretty sure it isn’t the latter (the physical config is fine and the temp indicators indicate normal). That leaves the driver, and a few things probably should be ruled out.

I flagged each of these “gonzaarcr” gnome-shell messages before but don’t know what to make of them. They precede every XID 62 message, suggesting they’re triggering the XID 62 event. Do you have any idea what these might be? It would be interesting to disable whatever is creating them to see whether the freeze and XID 62 continue.

It also is curious that something is injecting null characters into the log file in these time ranges. I’m not certain yet which message it is or even whether it matters, but I must add the grep -a option to get output when I pipe cat output to it. The log messages themselves may not matter but the same processes could just as easily be injecting null characters elsewhere, which may matter a great deal.

I also wonder if it matters that nvidiafb is not compiled into the kernel, for unknown (to me, at least) reasons.

Another matter is that the post-reboot messages include:

12<1>Apr 24 15:38:51 | user.warning | PN= | ST=/usr/lib/gdm3/gdm-x-session[1584]: | AN=- | MSG= (WW) NVIDIA: ‘/var/run/nvidia-xdriver-587226bc’ Permission denied
12<1>Apr 24 15:38:53 | user.warning | PN= | ST=/usr/lib/gdm3/gdm-x-session[305]: | AN=- | MSG= /usr/bin/prime-supported: 38: cannot create /var/log/prime-supported.log: Permission denied
12<1>Apr 24 15:38:53 | user.warning | PN= | ST=/usr/lib/gdm3/gdm-x-session[305]: | AN=- | MSG= /sbin/prime-offload: 29: cannot create /var/log/prime-offload.log: Permission denied
12<1>Apr 24 15:39:22 | user.warning | PN= | ST=/usr/lib/gdm3/gdm-x-session[2070]: | AN=- | MSG= (WW) NVIDIA: ‘/var/run/nvidia-xdriver-1b21b4a5’ Permission denied
12<1>Apr 24 15:39:22 | user.warning | PN= | ST=/usr/lib/gdm3/gdm-x-session[305]: | AN=- | MSG= /usr/bin/prime-supported: 38: cannot create /var/log/prime-supported.log: Permission denied
12<1>Apr 24 15:39:22 | user.warning | PN= | ST=/usr/lib/gdm3/gdm-x-session[305]: | AN=- | MSG= /sbin/prime-offload: 29: cannot create /var/log/prime-offload.log: Permission denied

So, either the driver or gdm-x-session is executing at least some things related to offload with insufficient permissions. This could be a big or small deal as far as I know. No other prime-related log messages exist but I also don’t know what that might indicate.

So, several leads on what might be afoot. I’m keenly interested in your take on them, and thanks again for your responsiveness thus far.

I also have seen some comments suggesting that Adaptive Clocking may occur too quickly on occasion, pushing rendering to the CPU if the GPU can’t level up fast enough to beat the timeout when a rendering demand occurs.

I previously oberved by chance that these GUI freezes occurred when xorg locked up one of the CPU cores, although I haven’t been able to catch this in the act recently. This behavior is consistent with those comments.

These CPU lockups ceased when I configured PowerMizer to “Prefer Maximum Performance” or 1, which increased lanes/PCIe to x16 from x1 and raised the GPU frequency ceiling. This seems to indicate this setting provides greater and more available offload capacity,

I wonder whether the XID 62 (internal micro-controller halt) isn’t tied to Adaptive Clocking, and whether previously-mentioned failures relating to starting prime offload could be preventing the internal micro-controller from starting.

/var/log/prime-offload.log: Permission denied
known bug in ubuntu since X is running rootless, doesn’t matter.
com.gonzaarcr.appmenu
gnome-shell plugin
https://github.com/gonzaarcr/Fildem
irrelevant.