Since a recent Arch update (i.e. new packages, as Arch uses a rolling update model) I get intermittent freezes of the whole X user interface when working in Blender with somewhat large 3D models (i.e. 8M triangles). The interface shows no response anymore to any input, nor screen updates of e.g. my CPU monitoring graph. The machine can still be reached remotely through SSH and I see Xorg taking 100% CPU. Running nvidia-smi or nvidia-bug-report in that situation the processes get stuck with no way of killing it (see the kernel messages below).
I attached gdb to the Xorg process and a backtrace shows it to be busy somewhere deep down in the nvidia driver:
(gdb) bt
#0 0x00007f25dff965f4 in () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#1 0x00007f25dff9671c in () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#2 0x00007f25dff2f7da in () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#3 0x00007f25e04658e3 in () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#4 0x40a0000041980000 in ()
#5 0x40a0000041a00000 in ()
#6 0x40b2c000419e8000 in ()
#7 0x40b2c000419e8000 in ()
#8 0x000056075fa37ec0 in ()
#9 0x000056075f968028 in ()
#10 0x0000000000000095 in ()
#11 0x000056075fa386b0 in ()
#12 0x0000000000000002 in ()
#13 0x00007f25e04661d6 in () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#14 0x00007ffe7bc625c0 in ()
#15 0x8b532189f6075b00 in ()
#16 0x0000560700000000 in ()
#17 0x000056075f812de0 in ()
#18 0x0000000000000254 in ()
#19 0x000000000000000b in ()
#20 0x002700165f227378 in ()
#21 0x000056075f227378 in ()
#22 0x000056075f7efae0 in ()
#23 0x000056075f974670 in ()
#24 0x0000002703000775 in ()
#25 0x0153218900000016 in ()
#26 0x000056075f1cb9c0 in ()
#27 0x000056075f207120 in ()
#28 0x0000000000020000 in ()
#29 0x000056075e49f559 in ValidatePicture ()
#30 0x000056075e49fb3c in CompositeTrapezoids ()
#31 0x000056075e4a2e7d in ()
#32 0x000056075e3b5fe8 in ()
#33 0x00007f25e33c7223 in __libc_start_main () at /usr/lib/libc.so.6
#34 0x000056075e3b630e in _start ()
But I see no real error message anywhere, only some libinput related errors in the Xorg log, could these cause this issue?
[ 173.258] (EE) client bug: timer event8 debounce: offset negative (-97ms)
[ 173.258] (EE) client bug: timer event8 debounce short: offset negative (-110ms)
I ran nvidia-bug-report during a freeze, but the log is incomplete, attaching it anyway. I’ll also attach a full log made directly after reboot. Some more info: this is an optimus system, but I always use the NVIDIA GPU. And I’ve been using it for a few years now for all kinds of 3D workloads and this is the first time I’ve seen a lockup like this.
Nvidia-smi getting stuck:
[ 614.266140] INFO: task nvidia-smi:1133 blocked for more than 120 seconds.
[ 614.266151] Tainted: P OE 4.20.1-arch1-1-ARCH #1
[ 614.266155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 614.266160] nvidia-smi D 0 1133 1038 0x00000084
[ 614.266166] Call Trace:
[ 614.266180] ? __schedule+0x29b/0x8b0
[ 614.266189] schedule+0x32/0x90
[ 614.266194] schedule_timeout+0x311/0x4a0
[ 614.266204] ? blk_mq_sched_insert_requests+0x80/0xa0
[ 614.266210] ? blk_mq_flush_plug_list+0x221/0x2c0
[ 614.266216] __down+0x7d/0xd0
[ 614.266224] ? _copy_from_user+0x20/0x60
[ 614.266231] down+0x3b/0x50
[ 614.266582] os_acquire_mutex+0x37/0x40 [nvidia]
[ 614.266930] _nv034770rm+0x5c/0x120 [nvidia]
[ 614.267260] ? _nv035933rm+0x17/0x30 [nvidia]
[ 614.267816] ? _nv035934rm+0x6f/0xf0 [nvidia]
[ 614.268145] ? _nv009474rm+0x7c/0xb0 [nvidia]
[ 614.268245] ? _nv034745rm+0xbb/0x1f0 [nvidia]
[ 614.268346] ? _nv034745rm+0x1aa/0x1f0 [nvidia]
[ 614.268446] ? _nv034746rm+0x4f/0x80 [nvidia]
[ 614.268549] ? _nv008191rm+0x43/0x60 [nvidia]
[ 614.268696] ? _nv001095rm+0x51e/0x850 [nvidia]
[ 614.268699] ? _raw_spin_unlock_irqrestore+0x20/0x40
[ 614.268845] ? rm_ioctl+0x73/0x100 [nvidia]
[ 614.268848] ? __kmalloc+0xb1/0x220
[ 614.268939] ? nvidia_ioctl+0x5f6/0x770 [nvidia]
[ 614.268942] ? preempt_count_add+0x79/0xb0
[ 614.269031] ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
[ 614.269034] ? do_vfs_ioctl+0xa4/0x630
[ 614.269037] ? syscall_trace_enter+0x1d3/0x2d0
[ 614.269039] ? ksys_ioctl+0x60/0x90
[ 614.269041] ? __x64_sys_ioctl+0x16/0x20
[ 614.269043] ? do_syscall_64+0x5b/0x170
[ 614.269045] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
nvidia-bug-report.log.gz (1.02 MB)
nvidia-bug-report-during-freeze.log.gz (59.4 KB)