X/NVIDIA freeze (Arch Linux) with 415.25 on a Quadro M1000M

paulmelis · January 14, 2019, 11:09am

Since a recent Arch update (i.e. new packages, as Arch uses a rolling update model) I get intermittent freezes of the whole X user interface when working in Blender with somewhat large 3D models (i.e. 8M triangles). The interface shows no response anymore to any input, nor screen updates of e.g. my CPU monitoring graph. The machine can still be reached remotely through SSH and I see Xorg taking 100% CPU. Running nvidia-smi or nvidia-bug-report in that situation the processes get stuck with no way of killing it (see the kernel messages below).

I attached gdb to the Xorg process and a backtrace shows it to be busy somewhere deep down in the nvidia driver:

(gdb) bt
#0  0x00007f25dff965f4 in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#1  0x00007f25dff9671c in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#2  0x00007f25dff2f7da in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#3  0x00007f25e04658e3 in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#4  0x40a0000041980000 in  ()
#5  0x40a0000041a00000 in  ()
#6  0x40b2c000419e8000 in  ()
#7  0x40b2c000419e8000 in  ()
#8  0x000056075fa37ec0 in  ()
#9  0x000056075f968028 in  ()
#10 0x0000000000000095 in  ()
#11 0x000056075fa386b0 in  ()
#12 0x0000000000000002 in  ()
#13 0x00007f25e04661d6 in  () at /usr/lib/xorg/modules/drivers/nvidia_drv.so
#14 0x00007ffe7bc625c0 in  ()
#15 0x8b532189f6075b00 in  ()
#16 0x0000560700000000 in  ()
#17 0x000056075f812de0 in  ()
#18 0x0000000000000254 in  ()
#19 0x000000000000000b in  ()
#20 0x002700165f227378 in  ()
#21 0x000056075f227378 in  ()
#22 0x000056075f7efae0 in  ()
#23 0x000056075f974670 in  ()
#24 0x0000002703000775 in  ()
#25 0x0153218900000016 in  ()
#26 0x000056075f1cb9c0 in  ()
#27 0x000056075f207120 in  ()
#28 0x0000000000020000 in  ()
#29 0x000056075e49f559 in ValidatePicture ()
#30 0x000056075e49fb3c in CompositeTrapezoids ()
#31 0x000056075e4a2e7d in  ()
#32 0x000056075e3b5fe8 in  ()
#33 0x00007f25e33c7223 in __libc_start_main () at /usr/lib/libc.so.6
#34 0x000056075e3b630e in _start ()

But I see no real error message anywhere, only some libinput related errors in the Xorg log, could these cause this issue?

[   173.258] (EE) client bug: timer event8 debounce: offset negative (-97ms)
[   173.258] (EE) client bug: timer event8 debounce short: offset negative (-110ms)

I ran nvidia-bug-report during a freeze, but the log is incomplete, attaching it anyway. I’ll also attach a full log made directly after reboot. Some more info: this is an optimus system, but I always use the NVIDIA GPU. And I’ve been using it for a few years now for all kinds of 3D workloads and this is the first time I’ve seen a lockup like this.
Nvidia-smi getting stuck:

[  614.266140] INFO: task nvidia-smi:1133 blocked for more than 120 seconds.
[  614.266151]       Tainted: P           OE     4.20.1-arch1-1-ARCH #1
[  614.266155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  614.266160] nvidia-smi      D    0  1133   1038 0x00000084
[  614.266166] Call Trace:
[  614.266180]  ? __schedule+0x29b/0x8b0
[  614.266189]  schedule+0x32/0x90
[  614.266194]  schedule_timeout+0x311/0x4a0
[  614.266204]  ? blk_mq_sched_insert_requests+0x80/0xa0
[  614.266210]  ? blk_mq_flush_plug_list+0x221/0x2c0
[  614.266216]  __down+0x7d/0xd0
[  614.266224]  ? _copy_from_user+0x20/0x60
[  614.266231]  down+0x3b/0x50
[  614.266582]  os_acquire_mutex+0x37/0x40 [nvidia]
[  614.266930]  _nv034770rm+0x5c/0x120 [nvidia]
[  614.267260]  ? _nv035933rm+0x17/0x30 [nvidia]
[  614.267816]  ? _nv035934rm+0x6f/0xf0 [nvidia]
[  614.268145]  ? _nv009474rm+0x7c/0xb0 [nvidia]
[  614.268245]  ? _nv034745rm+0xbb/0x1f0 [nvidia]
[  614.268346]  ? _nv034745rm+0x1aa/0x1f0 [nvidia]
[  614.268446]  ? _nv034746rm+0x4f/0x80 [nvidia]
[  614.268549]  ? _nv008191rm+0x43/0x60 [nvidia]
[  614.268696]  ? _nv001095rm+0x51e/0x850 [nvidia]
[  614.268699]  ? _raw_spin_unlock_irqrestore+0x20/0x40
[  614.268845]  ? rm_ioctl+0x73/0x100 [nvidia]
[  614.268848]  ? __kmalloc+0xb1/0x220
[  614.268939]  ? nvidia_ioctl+0x5f6/0x770 [nvidia]
[  614.268942]  ? preempt_count_add+0x79/0xb0
[  614.269031]  ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
[  614.269034]  ? do_vfs_ioctl+0xa4/0x630
[  614.269037]  ? syscall_trace_enter+0x1d3/0x2d0
[  614.269039]  ? ksys_ioctl+0x60/0x90
[  614.269041]  ? __x64_sys_ioctl+0x16/0x20
[  614.269043]  ? do_syscall_64+0x5b/0x170
[  614.269045]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

nvidia-bug-report.log.gz (1.02 MB)
nvidia-bug-report-during-freeze.log.gz (59.4 KB)

paulmelis · January 14, 2019, 11:14am

Hmmm, the driver download page on nvidia.com suggests 410.93 is the last available driver for this GPU. Checking my OS logs seems the previous driver I was using is 410.57.2. I guess I’ll try a driver downgrade.

paulmelis · January 14, 2019, 11:26am

Sigh, downgrading the drivers to 410.57.2 would mean downgrading the kernel as well.

generix · January 14, 2019, 11:43am

Maybe use the 4.19 lts kernel (linux-lts) instead of always upgrading to the latest. Would save you the hassle of the nvidia driver breaking.
[url]https://www.archlinux.org/packages/core/x86_64/linux-lts/[/url]

paulmelis · January 14, 2019, 1:02pm

Unfortunately, the latest nvidia-lts also is based on 415.25

paulmelis · January 14, 2019, 1:10pm

Oh wait, I should downgrade nvidia-lts, of course

paulmelis · January 14, 2019, 1:16pm

Ok, so that doesn’t work any better, as downgrading to nvidia-lts 410.57-2 means downgrading to a 4.14 kernel as well. So what’s the point of using the LTS version then?

paulmelis · January 14, 2019, 2:09pm

So with blender 2.8 there’s now GPU errors in the X log:

[   261.071] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x0000348c, 0x00003494)
[   268.071] (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x0000348c, 0x00003494)

paulmelis · January 14, 2019, 2:24pm

Hmm, this is getting weird. I went back to nvidia-390.xx, specifically the 390.87 driver, but under the 4.20.1 kernel. I now get the same freeze. So perhaps this has more to do with the kernel version that the actual NVIDIA driver.

Topic		Replies	Views
Frequent Freeze/Crash of Xorg with drivers 310.19 with GTS 250 on 3.2.0-4-amd64 Linux	20	15937	June 25, 2013
Display freezes: (EE) NVIDIA(GPU-0): WAIT Linux	28	9312	April 10, 2025
X hangs using 100% CPU, WAIT and mieq overflowing errors in logs Linux	67	23577	June 28, 2014
Complete freeze with nvidia-prime Linux	35	17791	May 18, 2018
X server crashes - GeForce GTX 660 - Driver 418.56 - archlinux Linux	25	2452	May 15, 2019
Arch linux \| hw: rtx 3070 ti \| driver 510.54-7 \| Display hangs while loading driver \| kernel Oops Drivers - Linux, Windows, MacOS kernel , nvbugs	15	5430	January 3, 2023
Ubuntu 18.04 completely freezes after a few minutes of being booted Linux	25	18234	October 8, 2021
resume from suspend freezes system (GTX 970, Arch Linux, Kernel 4.4/4.7, NVIDIA 370) Linux	171	58271	June 18, 2017
Nvidia 331.38 frequent Ubuntu 13.10 freeze , GTX 780M Kernel 3.11 Linux	4	5179	February 7, 2014
535.54.03 - System freezes when idle Linux nvbugs , gpu , linux-driver	4	1382	February 21, 2024

X/NVIDIA freeze (Arch Linux) with 415.25 on a Quadro M1000M

Related topics