NVIDIA 470.63.01 driver randomly hangs with no video output when resuming from suspend using the /proc interface on GeForce GTX 960

Hello, this is about a long-standing issue I have with the /proc interface, that I experience since at least driver 460.32.03.

When suspending the system with nvidia-suspend.service enabled the system will not resume properly in about 40% of all cases. Instead, the system will wake up leaving my monitors (2x DisplayPort) without a video signal. Reconnecting the monitors does not help in this situation, and trying to switch TTYs with Ctrl+Alt+F2 similarly does nothing.

But unlike some similar issues, the rest of the system seems to resume normally. The USB controllers are powered, the keyboard reacts when e.g. pressing capslock, and the NetworkManager and sshd services start as normal.

Investigating the issue over ssh shows that the nvidia-sleep.sh process is unresponsive at 100% CPU and can not be killed by SIGTERM, SIGKILL or even SIGSEGV. Running nvidia-bug-report.sh in this state hangs even with the recommended flags, producing the attached file. Trying to call systemctl poweroff over ssh kills the ssh connection, but does not actually shut down the computer. I need to either use SysRq+REISUB or perform a hard shutdown to regain control.

Just to make it clear, this only happens when nvidia-suspend.service in enabled and the system is therefore suspended with nvidia-sleep.sh suspend. Whether or not it is resumed with nvidia-sleep.sh resume does not make a difference, and neither does the activation status of NVreg_PreserveVideoMemoryAllocations.

Here are some things I have tried that did not fix the issue for me:

  • Setting NVreg_EnableMSI=0.
  • Setting acpi_osi=Windows 2015.
  • Using legacy persistence mode.
  • Using nvidia-persistenced.service based persistence mode.
  • Having nvidia-sleep.sh be called later during the resume process.
  • Having chvt be called later during the resume process.
  • Removing the “NVIDIA Corporation GM206 High Definition Audio Controller” [10de:0fba] PCI device using udev.
  • Enabling/Disabling NVreg_PreserveVideoMemoryAllocations.
  • Updating my BIOS.

These messages are written to the journal during the failing resume process:

kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 2 PID: 23018 at /build/linux514-nvidia/src/NVIDIA-Linux-x86_64-470.63.01-no-compat32/kernel/nvidia/nv.c:3967 nv_restore_user_channels+0xc9/0xe0 [nvidia]
kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio bnep uvcvideo btusb btrtl btbcm videobuf2_vmalloc btintel videobuf2_memops videob>
kernel:  sysimgblt fb_sys_fops nvidia(POE) soundcore mei intel_pch_thermal wmi video mac_hid acpi_pad drm ledtrig_timer sg crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid sr_mod>
kernel: CPU: 2 PID: 23018 Comm: nvidia-sleep.sh Tainted: P           OE     5.14.2-1-MANJARO #1
kernel: Hardware name: MSI MS-7A12/Z170A GAMING PRO CARBON (MS-7A12), BIOS 1.90 01/25/2018
kernel: RIP: 0010:nv_restore_user_channels+0xc9/0xe0 [nvidia]
kernel: Code: 89 9c d6 be 01 00 00 00 4c 89 e7 e8 41 a1 00 00 4c 89 ff e8 19 89 9c d6 ba 02 00 00 00 4c 89 e6 48 89 ef e8 59 79 9c 00 eb 94 <0f> 0b eb c6 41 bd 51 00 00 00 eb 9f 66 66 2e 0f 1f 84 00 00 00 00
kernel: RSP: 0018:ffffacbc469abe20 EFLAGS: 00010206
kernel: RAX: 0000000000000003 RBX: 0000000000000002 RCX: ffffacbc469abdb8
kernel: RDX: 0000000000000087 RSI: 0000000000000246 RDI: ffff995f83321028
kernel: RBP: ffff9963d03db000 R08: 0000000000000000 R09: ffff9964e6dacf30
kernel: R10: 0000000000000000 R11: 0000000000000003 R12: ffff995f88474000
kernel: R13: 0000000000000003 R14: ffff995f88474520 R15: ffff995f88474000
kernel: FS:  00007f8c5a014b80(0000) GS:ffff9964e6d00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f34b95f9000 CR3: 00000005f73a0003 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  nv_set_system_power_state+0x222/0x3c0 [nvidia]
kernel:  nv_procfs_write_suspend+0x100/0x150 [nvidia]
kernel:  proc_reg_write+0x55/0xa0
kernel:  vfs_write+0xbc/0x270
kernel:  ksys_write+0x67/0xe0
kernel:  do_syscall_64+0x3b/0x90
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f8c5a175907
kernel: Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
kernel: RSP: 002b:00007ffead963408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f8c5a175907
kernel: RDX: 0000000000000007 RSI: 000055b5dae39160 RDI: 0000000000000001
kernel: RBP: 000055b5dae39160 R08: 000000000000000a R09: 00007f8c5a246a60
kernel: R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000007
kernel: R13: 00007f8c5a247520 R14: 0000000000000007 R15: 00007f8c5a247700
kernel: ---[ end trace f449d36c8afbba7c ]---
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 2 PID: 23018 at /build/linux514-nvidia/src/NVIDIA-Linux-x86_64-470.63.01-no-compat32/kernel/nvidia/nv.c:4162 nv_set_system_power_state+0x2c0/0x3c0 [nvidia]
kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio bnep uvcvideo btusb btrtl btbcm videobuf2_vmalloc btintel videobuf2_memops videob>
kernel:  sysimgblt fb_sys_fops nvidia(POE) soundcore mei intel_pch_thermal wmi video mac_hid acpi_pad drm ledtrig_timer sg crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid sr_mod>
kernel: CPU: 2 PID: 23018 Comm: nvidia-sleep.sh Tainted: P        W  OE     5.14.2-1-MANJARO #1
kernel: Hardware name: MSI MS-7A12/Z170A GAMING PRO CARBON (MS-7A12), BIOS 1.90 01/25/2018
kernel: RIP: 0010:nv_set_system_power_state+0x2c0/0x3c0 [nvidia]
kernel: Code: ed 0f 84 4c ff ff ff 41 83 fc 02 74 ea 48 8b 85 88 02 00 00 be 02 00 00 00 48 8b 78 78 e8 b8 d1 ff ff 85 c0 74 d1 0f 0b eb cd <0f> 0b e9 63 ff ff ff 48 c7 c7 d0 fa 4e c2 e8 5d 58 9c d6 e8 78 1c
kernel: RSP: 0018:ffffacbc469abe50 EFLAGS: 00010206
kernel: RAX: 0000000000000003 RBX: 0000000000000002 RCX: ffff995f83321560
kernel: RDX: 0000000003e37e02 RSI: ffffffffc052e954 RDI: 000033575900a6d0
kernel: RBP: ffff995f88474000 R08: 0000000000000000 R09: ffff9964e6dacf30
kernel: R10: ffff9963d03db000 R11: 0000000000000003 R12: 0000000000000000
kernel: R13: 000055b5dae39160 R14: ffffacbc469abf08 R15: 0000000000000007
kernel: FS:  00007f8c5a014b80(0000) GS:ffff9964e6d00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f34b95f9000 CR3: 00000005f73a0003 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  nv_procfs_write_suspend+0x100/0x150 [nvidia]
kernel:  proc_reg_write+0x55/0xa0
kernel:  vfs_write+0xbc/0x270
kernel:  ksys_write+0x67/0xe0
kernel:  do_syscall_64+0x3b/0x90
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f8c5a175907
kernel: Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
kernel: RSP: 002b:00007ffead963408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f8c5a175907
kernel: RDX: 0000000000000007 RSI: 000055b5dae39160 RDI: 0000000000000001
kernel: RBP: 000055b5dae39160 R08: 000000000000000a R09: 00007f8c5a246a60
kernel: R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000007
kernel: R13: 00007f8c5a247520 R14: 0000000000000007 R15: 00007f8c5a247700
kernel: ---[ end trace f449d36c8afbba7d ]---
kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000957d:0:0:428

These messages are repeatedly written later after the resume process, indicating the stuck nvidia-sleep.sh process is blocking the nvidia_modeset driver:

kernel: INFO: task nvidia-modeset/:355 blocked for more than 122 seconds.
kernel:       Tainted: P        W  OE     5.14.2-1-MANJARO #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:nvidia-modeset/ state:D stack:    0 pid:  355 ppid:     2 flags:0x00004000
kernel: Call Trace:
kernel:  __schedule+0x316/0x940
kernel:  schedule+0x59/0xc0
kernel:  rwsem_down_read_slowpath+0x384/0x3e0
kernel:  nvkms_kthread_q_callback+0x71/0x100 [nvidia_modeset]
kernel:  _main_loop+0x9e/0x150 [nvidia_modeset]
kernel:  ? nvkms_sema_up+0x10/0x10 [nvidia_modeset]
kernel:  kthread+0x132/0x160
kernel:  ? set_kthread_struct+0x40/0x40
kernel:  ret_from_fork+0x22/0x30

OS: Manjaro 21.1.3 Pahvo
CPU: Intel Core i5-6600K CPU @ 3.50GHz
GPU: GeForce GTX 960
Driver: NVIDIA 470.63.01 (linux514-nvidia-470.63.01-4-x86_64 from the Manjaro repositories)
Mainboard: MSI MS-7A12/Z170A GAMING PRO CARBON (MS-7A12)
Kernel: 5.14.2-1-MANJARO

nvidia-bug-report.log.gz (1.1 KB)

1 Like

Hi insert-penguin, any luck resolving this? I’ve got a similar issue, also a GTX 960, the warning I get is:

kernel: [45886.201132] ------------[ cut here ]------------
kernel: [45886.201134] WARNING: CPU: 7 PID: 76616 at /var/lib/dkms/nvidia/470.63.01/build/nvidia/nv.c:3967 nv_restore_user_channels+0xce/0xe0 [nvidia]
kernel: [45886.201284] Modules linked in: rfcomm nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables libcrc32c nfnetlink cmac bridge algif_hash stp llc algif_skcipher overlay af_alg bnep nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) intel_rapl_msr intel_rapl_common nvidia(POE) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel iwlmvm snd_intel_dspcfg soundwire_intel soundwire_generic_allocation soundwire_cadence nls_iso8859_1 snd_hda_codec edac_mce_amd mac80211 snd_hda_core snd_hwdep soundwire_bus kvm_amd snd_soc_core libarc4 kvm snd_compress ac97_bus snd_pcm_dmaengine btusb snd_seq_midi btrtl snd_seq_midi_event btbcm crct10dif_pclmul btintel ghash_clmulni_intel snd_pcm snd_rawmidi bluetooth aesni_intel snd_seq crypto_simd drm_kms_helper cryptd glue_helper
kernel: [45886.201312]  snd_seq_device ecdh_generic rapl joydev input_leds eeepc_wmi wmi_bmof mxm_wmi efi_pstore iwlwifi ccp ecc cec snd_timer k10temp snd rc_core cfg80211 fb_sys_fops syscopyarea soundcore sysfillrect sysimgblt mac_hid sch_fq_codel msr parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid mfd_aaeon asus_wmi sparse_keymap video igb nvme ahci i2c_algo_bit xhci_pci crc32_pclmul i2c_piix4 libahci nvme_core xhci_pci_renesas dca wmi
kernel: [45886.201333] CPU: 7 PID: 76616 Comm: nvidia-sleep.sh Tainted: P        W  OE     5.11.0-37-generic #41-Ubuntu
kernel: [45886.201335] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 3604 05/08/2021
kernel: [45886.201336] RIP: 0010:nv_restore_user_channels+0xce/0xe0 [nvidia]
kernel: [45886.201446] Code: 05 a3 cf be 01 00 00 00 4c 89 ef e8 7c a5 00 00 48 89 df e8 64 04 a3 cf ba 02 00 00 00 4c 89 ee 4c 89 e7 e8 34 83 9c 00 eb 93 <0f> 0b eb c6 41 be 51 00 00 00 eb 9e 66 0f 1f 44 00 00 0f 1f 44 00
kernel: [45886.201447] RSP: 0018:ffffab994252bde8 EFLAGS: 00010206
kernel: [45886.201448] RAX: 0000000000000003 RBX: ffff9dde46c04000 RCX: ffffab994252bd80
kernel: [45886.201448] RDX: 0000000000000087 RSI: 0000000000000246 RDI: 0000000000000246
kernel: [45886.201449] RBP: ffffab994252be10 R08: 0000000000000000 R09: ffff9de54ea2c3f0
kernel: [45886.201449] R10: 0000000000000000 R11: 00000000000001ae R12: ffff9de2a05a8000
kernel: [45886.201451] R13: ffff9dde46c04000 R14: 0000000000000003 R15: ffff9dde46c04520
kernel: [45886.201451] FS:  00007f0b00989740(0000) GS:ffff9de54ebc0000(0000) knlGS:0000000000000000
kernel: [45886.201452] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [45886.201453] CR2: 00001bb800485068 CR3: 000000011f396000 CR4: 0000000000750ee0
kernel: [45886.201454] PKRU: 55555554
kernel: [45886.201454] Call Trace:
kernel: [45886.201456]  nv_set_system_power_state+0x228/0x3d0 [nvidia]
kernel: [45886.201566]  nv_procfs_write_suspend+0xea/0x140 [nvidia]
kernel: [45886.201676]  proc_reg_write+0x5a/0x90
kernel: [45886.201680]  ? _cond_resched+0x1a/0x50
kernel: [45886.201682]  vfs_write+0xc6/0x270
kernel: [45886.201684]  ksys_write+0x67/0xe0
kernel: [45886.201686]  __x64_sys_write+0x1a/0x20
kernel: [45886.201687]  do_syscall_64+0x38/0x90
kernel: [45886.201688]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: [45886.201690] RIP: 0033:0x7f0b00a93c27
kernel: [45886.201691] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
kernel: [45886.201691] RSP: 002b:00007fff21085848 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: [45886.201693] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f0b00a93c27
kernel: [45886.201693] RDX: 0000000000000007 RSI: 000055b314034fe0 RDI: 0000000000000001
kernel: [45886.201694] RBP: 000055b314034fe0 R08: 000000000000000a R09: 000055b314034fe0
kernel: [45886.201694] R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000007
kernel: [45886.201695] R13: 00007f0b00b6d6c0 R14: 00007f0b00b6e4a0 R15: 00007f0b00b6d8a0
kernel: [45886.201696] ---[ end trace 58ffcae3d517b433 ]---
kernel: [45886.201702] ------------[ cut here ]------------

Also happen in Ubuntu 20.04.3 with Kepler GPU.
Few times the machine will resume without problems.