Hello, this is about a long-standing issue I have with the /proc
interface, that I experience since at least driver 460.32.03
.
When suspending the system with nvidia-suspend.service
enabled the system will not resume properly in about 40% of all cases. Instead, the system will wake up leaving my monitors (2x DisplayPort) without a video signal. Reconnecting the monitors does not help in this situation, and trying to switch TTYs with Ctrl+Alt+F2
similarly does nothing.
But unlike some similar issues, the rest of the system seems to resume normally. The USB controllers are powered, the keyboard reacts when e.g. pressing capslock, and the NetworkManager
and sshd
services start as normal.
Investigating the issue over ssh
shows that the nvidia-sleep.sh
process is unresponsive at 100% CPU and can not be killed by SIGTERM
, SIGKILL
or even SIGSEGV
. Running nvidia-bug-report.sh
in this state hangs even with the recommended flags, producing the attached file. Trying to call systemctl poweroff
over ssh
kills the ssh
connection, but does not actually shut down the computer. I need to either use SysRq+REISUB
or perform a hard shutdown to regain control.
Just to make it clear, this only happens when nvidia-suspend.service
in enabled and the system is therefore suspended with nvidia-sleep.sh suspend
. Whether or not it is resumed with nvidia-sleep.sh resume
does not make a difference, and neither does the activation status of NVreg_PreserveVideoMemoryAllocations
.
Here are some things I have tried that did not fix the issue for me:
- Setting
NVreg_EnableMSI=0
. - Setting
acpi_osi=Windows 2015
. - Using legacy persistence mode.
- Using
nvidia-persistenced.service
based persistence mode. - Having
nvidia-sleep.sh
be called later during the resume process. - Having
chvt
be called later during the resume process. - Removing the “NVIDIA Corporation GM206 High Definition Audio Controller” [10de:0fba] PCI device using udev.
- Enabling/Disabling
NVreg_PreserveVideoMemoryAllocations
. - Updating my BIOS.
These messages are written to the journal during the failing resume process:
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 2 PID: 23018 at /build/linux514-nvidia/src/NVIDIA-Linux-x86_64-470.63.01-no-compat32/kernel/nvidia/nv.c:3967 nv_restore_user_channels+0xc9/0xe0 [nvidia]
kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio bnep uvcvideo btusb btrtl btbcm videobuf2_vmalloc btintel videobuf2_memops videob>
kernel: sysimgblt fb_sys_fops nvidia(POE) soundcore mei intel_pch_thermal wmi video mac_hid acpi_pad drm ledtrig_timer sg crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid sr_mod>
kernel: CPU: 2 PID: 23018 Comm: nvidia-sleep.sh Tainted: P OE 5.14.2-1-MANJARO #1
kernel: Hardware name: MSI MS-7A12/Z170A GAMING PRO CARBON (MS-7A12), BIOS 1.90 01/25/2018
kernel: RIP: 0010:nv_restore_user_channels+0xc9/0xe0 [nvidia]
kernel: Code: 89 9c d6 be 01 00 00 00 4c 89 e7 e8 41 a1 00 00 4c 89 ff e8 19 89 9c d6 ba 02 00 00 00 4c 89 e6 48 89 ef e8 59 79 9c 00 eb 94 <0f> 0b eb c6 41 bd 51 00 00 00 eb 9f 66 66 2e 0f 1f 84 00 00 00 00
kernel: RSP: 0018:ffffacbc469abe20 EFLAGS: 00010206
kernel: RAX: 0000000000000003 RBX: 0000000000000002 RCX: ffffacbc469abdb8
kernel: RDX: 0000000000000087 RSI: 0000000000000246 RDI: ffff995f83321028
kernel: RBP: ffff9963d03db000 R08: 0000000000000000 R09: ffff9964e6dacf30
kernel: R10: 0000000000000000 R11: 0000000000000003 R12: ffff995f88474000
kernel: R13: 0000000000000003 R14: ffff995f88474520 R15: ffff995f88474000
kernel: FS: 00007f8c5a014b80(0000) GS:ffff9964e6d00000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f34b95f9000 CR3: 00000005f73a0003 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel: nv_set_system_power_state+0x222/0x3c0 [nvidia]
kernel: nv_procfs_write_suspend+0x100/0x150 [nvidia]
kernel: proc_reg_write+0x55/0xa0
kernel: vfs_write+0xbc/0x270
kernel: ksys_write+0x67/0xe0
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f8c5a175907
kernel: Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
kernel: RSP: 002b:00007ffead963408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f8c5a175907
kernel: RDX: 0000000000000007 RSI: 000055b5dae39160 RDI: 0000000000000001
kernel: RBP: 000055b5dae39160 R08: 000000000000000a R09: 00007f8c5a246a60
kernel: R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000007
kernel: R13: 00007f8c5a247520 R14: 0000000000000007 R15: 00007f8c5a247700
kernel: ---[ end trace f449d36c8afbba7c ]---
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 2 PID: 23018 at /build/linux514-nvidia/src/NVIDIA-Linux-x86_64-470.63.01-no-compat32/kernel/nvidia/nv.c:4162 nv_set_system_power_state+0x2c0/0x3c0 [nvidia]
kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio bnep uvcvideo btusb btrtl btbcm videobuf2_vmalloc btintel videobuf2_memops videob>
kernel: sysimgblt fb_sys_fops nvidia(POE) soundcore mei intel_pch_thermal wmi video mac_hid acpi_pad drm ledtrig_timer sg crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid sr_mod>
kernel: CPU: 2 PID: 23018 Comm: nvidia-sleep.sh Tainted: P W OE 5.14.2-1-MANJARO #1
kernel: Hardware name: MSI MS-7A12/Z170A GAMING PRO CARBON (MS-7A12), BIOS 1.90 01/25/2018
kernel: RIP: 0010:nv_set_system_power_state+0x2c0/0x3c0 [nvidia]
kernel: Code: ed 0f 84 4c ff ff ff 41 83 fc 02 74 ea 48 8b 85 88 02 00 00 be 02 00 00 00 48 8b 78 78 e8 b8 d1 ff ff 85 c0 74 d1 0f 0b eb cd <0f> 0b e9 63 ff ff ff 48 c7 c7 d0 fa 4e c2 e8 5d 58 9c d6 e8 78 1c
kernel: RSP: 0018:ffffacbc469abe50 EFLAGS: 00010206
kernel: RAX: 0000000000000003 RBX: 0000000000000002 RCX: ffff995f83321560
kernel: RDX: 0000000003e37e02 RSI: ffffffffc052e954 RDI: 000033575900a6d0
kernel: RBP: ffff995f88474000 R08: 0000000000000000 R09: ffff9964e6dacf30
kernel: R10: ffff9963d03db000 R11: 0000000000000003 R12: 0000000000000000
kernel: R13: 000055b5dae39160 R14: ffffacbc469abf08 R15: 0000000000000007
kernel: FS: 00007f8c5a014b80(0000) GS:ffff9964e6d00000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f34b95f9000 CR3: 00000005f73a0003 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel: nv_procfs_write_suspend+0x100/0x150 [nvidia]
kernel: proc_reg_write+0x55/0xa0
kernel: vfs_write+0xbc/0x270
kernel: ksys_write+0x67/0xe0
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f8c5a175907
kernel: Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
kernel: RSP: 002b:00007ffead963408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f8c5a175907
kernel: RDX: 0000000000000007 RSI: 000055b5dae39160 RDI: 0000000000000001
kernel: RBP: 000055b5dae39160 R08: 000000000000000a R09: 00007f8c5a246a60
kernel: R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000007
kernel: R13: 00007f8c5a247520 R14: 0000000000000007 R15: 00007f8c5a247700
kernel: ---[ end trace f449d36c8afbba7d ]---
kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000957d:0:0:428
These messages are repeatedly written later after the resume process, indicating the stuck nvidia-sleep.sh
process is blocking the nvidia_modeset
driver:
kernel: INFO: task nvidia-modeset/:355 blocked for more than 122 seconds.
kernel: Tainted: P W OE 5.14.2-1-MANJARO #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:nvidia-modeset/ state:D stack: 0 pid: 355 ppid: 2 flags:0x00004000
kernel: Call Trace:
kernel: __schedule+0x316/0x940
kernel: schedule+0x59/0xc0
kernel: rwsem_down_read_slowpath+0x384/0x3e0
kernel: nvkms_kthread_q_callback+0x71/0x100 [nvidia_modeset]
kernel: _main_loop+0x9e/0x150 [nvidia_modeset]
kernel: ? nvkms_sema_up+0x10/0x10 [nvidia_modeset]
kernel: kthread+0x132/0x160
kernel: ? set_kthread_struct+0x40/0x40
kernel: ret_from_fork+0x22/0x30
OS: Manjaro 21.1.3 Pahvo
CPU: Intel Core i5-6600K CPU @ 3.50GHz
GPU: GeForce GTX 960
Driver: NVIDIA 470.63.01 (linux514-nvidia-470.63.01-4-x86_64
from the Manjaro repositories)
Mainboard: MSI MS-7A12/Z170A GAMING PRO CARBON (MS-7A12)
Kernel: 5.14.2-1-MANJARO
nvidia-bug-report.log.gz (1.1 KB)