I’ve had this error sporadically under the 470.94 driver. In contrast to previous reports of this error happening while the GPU is heavily loaded, possibly due to power supply problems, I’ve only encountered it while the machine was idle for 6+ hours, and only while both attached monitors were manually switched off. I have not yet had this fault while actively using the system, or with the monitors in an automatic sleep mode, so I wonder whether it might be a DDC issue (e.g. driver crapping itself when the DDC is down for long periods). I just downloaded 470.103 and will see whether that changes anything, although I don’t see any related bugfixes in the release notes.
When the error occurs, the displays are rendered permanently inactive (until reboot) and the kernel message log shows the following:
[657106.751771] NVRM: GPU at PCI:0000:28:00: GPU-a60c1060-54c9-b922-7733-c717fdc14af5
[657106.751779] NVRM: Xid (PCI:0000:28:00): 79, pid=1907, GPU has fallen off the bus.
[657106.751786] NVRM: GPU 0000:28:00.0: GPU has fallen off the bus.
[657106.751802] NVRM: GPU 0000:28:00.0: GPU serial number is .
[657106.751822] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[657444.372680] GpuWatchdog[6269]: segfault at 0 ip 00007fc8ad762b5f sp 00007fc8a47bd460 error 6 in libcef.so[7fc8a9528000+6f56000]
You can see from the timestamps that I had plenty of stable uptime before the fault. I can log into the machine via SSH and manually unload the nvidia kernel modules, but when I attempt another modprobe nvidia
I get a long stream of
[662558.412133] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000987d:0:0:0x0000000f
(repeated many times) and then eventually a crash in nvidia_dev_put():
[663149.792091] [drm] [nvidia-drm] [GPU ID 0x00002800] Unloading driver
[663149.793553] ------------[ cut here ]------------
[663149.793555] WARNING: CPU: 13 PID: 1227 at /tmp/selfgz7248/NVIDIA-Linux-x86_64-470.94/kernel/nvidia/nv.c:4757 nvidia_dev_put+0x92/0xa0 [nvidia]
[663149.793974] CPU: 13 PID: 1227 Comm: rmmod Tainted: P O 5.15.18 #4
[663149.793980] RIP: 0010:nvidia_dev_put+0x92/0xa0 [nvidia]
Full nvidia-bug-report.sh logs attached. You’ll note in the dmesg log that I had booted the kernel with the command-line option rcutree.rcu_idle_gp_delay=1
, which was suggested in another forum as a folk remedy for these errors but obviously doesn’t help. My system hangs on boot when I add acpi_osi=! acpi_osi='Linux'
, another frequently suggested folk remedy.
I don’t recall every seeing these errors when I was running the 460.x series drivers, although those had other stability problems that 470.x generally seems to improve upon.
nvidia-bug-report.log.gz (1.1 MB)