X/NVIDIA freeze (Arch Linux) with 415.25 on a Quadro M1000M

Since a recent Arch update (i.e. new packages, as Arch uses a rolling update model) I get intermittent freezes of the whole X user interface when working in Blender with somewhat large 3D models (i.e. 8M triangles). The interface shows no response anymore to any input, nor screen updates of e.g. my CPU monitoring graph. The machine can still be reached remotely through SSH and I see Xorg taking 100% CPU. Running nvidia-smi or nvidia-bug-report in that situation the processes get stuck with no way of killing it (see the kernel messages below).

I attached gdb to the Xorg process and a backtrace shows it to be busy somewhere deep down in the nvidia driver:

(gdb) bt
#0  0x00007f25dff965f4 in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#1  0x00007f25dff9671c in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#2  0x00007f25dff2f7da in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#3  0x00007f25e04658e3 in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#4  0x40a0000041980000 in  ()
#5  0x40a0000041a00000 in  ()
#6  0x40b2c000419e8000 in  ()
#7  0x40b2c000419e8000 in  ()
#8  0x000056075fa37ec0 in  ()
#9  0x000056075f968028 in  ()
#10 0x0000000000000095 in  ()
#11 0x000056075fa386b0 in  ()
#12 0x0000000000000002 in  ()
#13 0x00007f25e04661d6 in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#14 0x00007ffe7bc625c0 in  ()
#15 0x8b532189f6075b00 in  ()
#16 0x0000560700000000 in  ()
#17 0x000056075f812de0 in  ()
#18 0x0000000000000254 in  ()
#19 0x000000000000000b in  ()
#20 0x002700165f227378 in  ()
#21 0x000056075f227378 in  ()
#22 0x000056075f7efae0 in  ()
#23 0x000056075f974670 in  ()
#24 0x0000002703000775 in  ()
#25 0x0153218900000016 in  ()
#26 0x000056075f1cb9c0 in  ()
#27 0x000056075f207120 in  ()
#28 0x0000000000020000 in  ()
#29 0x000056075e49f559 in ValidatePicture ()
#30 0x000056075e49fb3c in CompositeTrapezoids ()
#31 0x000056075e4a2e7d in  ()
#32 0x000056075e3b5fe8 in  ()
#33 0x00007f25e33c7223 in __libc_start_main () at /usr/lib/libc.so.6
#34 0x000056075e3b630e in _start ()

But I see no real error message anywhere, only some libinput related errors in the Xorg log, could these cause this issue?

[   173.258] (EE) client bug: timer event8 debounce: offset negative (-97ms)
[   173.258] (EE) client bug: timer event8 debounce short: offset negative (-110ms)

I ran nvidia-bug-report during a freeze, but the log is incomplete, attaching it anyway. I’ll also attach a full log made directly after reboot. Some more info: this is an optimus system, but I always use the NVIDIA GPU. And I’ve been using it for a few years now for all kinds of 3D workloads and this is the first time I’ve seen a lockup like this.
Nvidia-smi getting stuck:

[  614.266140] INFO: task nvidia-smi:1133 blocked for more than 120 seconds.
[  614.266151]       Tainted: P           OE     4.20.1-arch1-1-ARCH #1
[  614.266155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  614.266160] nvidia-smi      D    0  1133   1038 0x00000084
[  614.266166] Call Trace:
[  614.266180]  ? __schedule+0x29b/0x8b0
[  614.266189]  schedule+0x32/0x90
[  614.266194]  schedule_timeout+0x311/0x4a0
[  614.266204]  ? blk_mq_sched_insert_requests+0x80/0xa0
[  614.266210]  ? blk_mq_flush_plug_list+0x221/0x2c0
[  614.266216]  __down+0x7d/0xd0
[  614.266224]  ? _copy_from_user+0x20/0x60
[  614.266231]  down+0x3b/0x50
[  614.266582]  os_acquire_mutex+0x37/0x40 [nvidia]
[  614.266930]  _nv034770rm+0x5c/0x120 [nvidia]
[  614.267260]  ? _nv035933rm+0x17/0x30 [nvidia]
[  614.267816]  ? _nv035934rm+0x6f/0xf0 [nvidia]
[  614.268145]  ? _nv009474rm+0x7c/0xb0 [nvidia]
[  614.268245]  ? _nv034745rm+0xbb/0x1f0 [nvidia]
[  614.268346]  ? _nv034745rm+0x1aa/0x1f0 [nvidia]
[  614.268446]  ? _nv034746rm+0x4f/0x80 [nvidia]
[  614.268549]  ? _nv008191rm+0x43/0x60 [nvidia]
[  614.268696]  ? _nv001095rm+0x51e/0x850 [nvidia]
[  614.268699]  ? _raw_spin_unlock_irqrestore+0x20/0x40
[  614.268845]  ? rm_ioctl+0x73/0x100 [nvidia]
[  614.268848]  ? __kmalloc+0xb1/0x220
[  614.268939]  ? nvidia_ioctl+0x5f6/0x770 [nvidia]
[  614.268942]  ? preempt_count_add+0x79/0xb0
[  614.269031]  ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
[  614.269034]  ? do_vfs_ioctl+0xa4/0x630
[  614.269037]  ? syscall_trace_enter+0x1d3/0x2d0
[  614.269039]  ? ksys_ioctl+0x60/0x90
[  614.269041]  ? __x64_sys_ioctl+0x16/0x20
[  614.269043]  ? do_syscall_64+0x5b/0x170
[  614.269045]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

nvidia-bug-report.log.gz (1.02 MB)
nvidia-bug-report-during-freeze.log.gz (59.4 KB)

Hmmm, the driver download page on nvidia.com suggests 410.93 is the last available driver for this GPU. Checking my OS logs seems the previous driver I was using is 410.57.2. I guess I’ll try a driver downgrade.

Sigh, downgrading the drivers to 410.57.2 would mean downgrading the kernel as well.

Maybe use the 4.19 lts kernel (linux-lts) instead of always upgrading to the latest. Would save you the hassle of the nvidia driver breaking.
[url]https://www.archlinux.org/packages/core/x86_64/linux-lts/[/url]

Unfortunately, the latest nvidia-lts also is based on 415.25

Oh wait, I should downgrade nvidia-lts, of course

Ok, so that doesn’t work any better, as downgrading to nvidia-lts 410.57-2 means downgrading to a 4.14 kernel as well. So what’s the point of using the LTS version then?

So with blender 2.8 there’s now GPU errors in the X log:

[   261.071] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x0000348c, 0x00003494)
[   268.071] (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x0000348c, 0x00003494)
1 Like

Hmm, this is getting weird. I went back to nvidia-390.xx, specifically the 390.87 driver, but under the 4.20.1 kernel. I now get the same freeze. So perhaps this has more to do with the kernel version that the actual NVIDIA driver.