Kernel 5.6: system freeze when resuming from suspend or hibernate

As I posted before they are already enabled.

What you posted shows that they are not enabled. (Active: inactive (dead).) “Loaded” does not mean the service is enabled. It just means systemd has seen the service file and loaded it, not that it actually started it.

Enable these services and reboot. The status command will then look very different.

I did what suggested (again) and the output of status command is exactly the same. Look at the word ‘enabled’ after service file name. This indicates that the service is indeed enabled. As for why it’s not active? Well, looking at service file:

[Unit]
Description=NVIDIA system suspend actions
Before=systemd-suspend.service

[Service]
Type=oneshot
ExecStart=/usr/bin/logger -t suspend -s “nvidia-suspend.service”
ExecStart=/usr/bin/nvidia-sleep.sh “suspend”

[Install]
RequiredBy=systemd-suspend.service

It seems that it runs when system goes to specific state (suspend in this example) and it is oneshot, so why would it be active all the time?

Yeah, you’re right – these services are launched before and after systemd actually suspends the system in order to give the kernel driver an opportunity to save vidmem contents before the GPU is powered off. Your very first post shows that this mechanism was working because the kernel printed Comm: nvidia-sleep.sh, which is the script that is launched by the systemd units.

Can you please run nvidia-bug-report.sh after reproducing the soft lockup problem with the latest driver and attach the nvidia-bug-report.log.gz file here?

I reproduced the problem and quickly switched to another tty to run the script. The script hung during execution and ‘cat’ subprocess was eating 100% cpu. I suppose it tried to read something from nvidia device file. I tried to execute the script with --safe-mode and --extra-system-data parameters, but it still hangs. Attaching resulting file, not sure if it has enough data. I’m not an expert here, but it looks like an infinite loop of sorts inside the kernel module (as nvidia-sleep.sh also consumes 100% cpu).
nvidia-bug-report.log.gz (1.1 KB)

I reproduced the problem with kernel 5.11.13 with acpi_osi=! acpi_osi=“Windows 2009” set. Attaching dmesg output that contains some stack trace and nvidia-bug-report log (it is formed under normal conditions though, i.e. before freeze, as I can not get full log when freeze occurs).
dmesg.txt (119.6 KB)
nvidia-bug-report.log.gz (1.1 MB)

465.24.02 - same thing happening

I tried to use older kernel (5.3.7) but the problem persists even there (465.27 driver). Looks like I was wrong initially and it is not related to 5.5-5.6 transition.

Looking at the logs, it seems you tried to enable dynamic power-management on a platform that doesn’t support this. Shouldn’t matter but in your case the nvidia audio device can’t be turned on anymore (did you use the udev rules to set runpm to auto). Don’t know if this has an influence but you should rather remove that.

I didn’t place any udev rules manually but maybe some package did. Can you hint what should I look for? Specific rule contents? Also, I’ll remove NVreg_DynamicPowerManagement from my kernel cmdline, thanks.

Please see this:
https://forums.developer.nvidia.com/t/no-option-for-audio-over-displayport-hdmi/175889/2?u=generix

I followed the guide from that post but it seems nvidia audio can not be turned on no matter what.

snd_hda_intel 0000:01:00.1: can’t change power state from D3cold to D0 (config space inaccessible)
snd_hda_intel 0000:01:00.1: can’t change power state from D3hot to D0 (config space inaccessible)

01:00.1 Audio device: NVIDIA Corporation GK107 HDMI Audio Controller (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

I have dual boot with Windows 10, I checked in Windows and that audio device is not present either. But then again, HDMI output I have comes from integrated graphics, not nvidia. So 740M card doesn’t have any outputs. My guess is that audio device is disabled by manufacturer’s (ASUS) design. Windows 10 comes from stand by mode just fine, maybe nvidia driver tries to power audio unconditionally.
I’ll try to remove those udev rules completely.

UPDATE: Nope, that didn’t help

Thank you, @generix, for pointing me in the right direction! I added the simple udev rule to remove problematic device completely:

cat /etc/udev/rules.d/10-remove-nvidia-audio.rules
ACTION==“add”, KERNEL==“0000:01:00.1”, SUBSYSTEM==“pci”, RUN+=“/bin/sh -c ‘echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove’”

Now lspci doesn’t list the device. And the problem is gone! Laptop wakes up perfectly!

Nice sleuthing tracking that down.

I wonder if this problem is related to a workaround in the Linux kernel that tries to enable audio on NVIDIA devices in laptops that normally disable them at boot and expect the Windows driver to enable it dynamically. There was a thread about this recently: [Nouveau] [PATCH v2] ALSA: hda: Continue to probe when codec probe fails

It’s possible that this Linux kernel quirk is enabling the audio function on your GPU when it really doesn’t have one, causing this problem.

The quirk was introduced in kernel 5.4, the discussions back then resembled the new one, really a déja vue.
Though I guess while the dead audio device being the trigger, the cause is the new power management, as to why it needs to access the audio device and hangs if it’s inaccessible. Since the other thread linked here showed this can happen in other ways, too.

Also experiencing a very similar issue to this.

Jul 08 13:18:58 kernel: ------------[ cut here ]------------
Jul 08 13:18:58 kernel: WARNING: CPU: 3 PID: 3133 at /var/lib/dkms/nvidia/465.31/build/nvidia/nv.c:3909 nv_restore_user_channels+0xce/0xe0 [nvidia]
Jul 08 13:18:58 kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 lib>
Jul 08 13:18:58 kernel:  crypto_simd cryptd snd_pcm glue_helper drm_kms_helper snd_seq_device rapl intel_cstate efi_pstore wmi_bmof intel_wmi_thunderbolt input_leds joydev snd_timer mxm_wmi cec snd rc_core ee1004 fb_sys_fops so>
Jul 08 13:18:58 kernel: CPU: 3 PID: 3133 Comm: nvidia-sleep.sh Tainted: P           OE     5.8.0-59-generic #66~20.04.1-Ubuntu
Jul 08 13:18:58 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C80/MAG Z490 TOMAHAWK (MS-7C80), BIOS 1.80 04/19/2021
Jul 08 13:18:58 kernel: RIP: 0010:nv_restore_user_channels+0xce/0xe0 [nvidia]
Jul 08 13:18:58 kernel: Code: 08 c1 da be 01 00 00 00 4c 89 ef e8 9c 9c 00 00 48 89 df e8 24 09 c1 da ba 02 00 00 00 4c 89 ee 4c 89 e7 e8 44 eb 99 00 eb 93 <0f> 0b eb c6 41 be 51 00 00 00 eb 9e 66 0f 1f 44 00 00 0f 1f 44 00
Jul 08 13:18:58 kernel: RSP: 0018:ffff9aff054bfde8 EFLAGS: 00010206
Jul 08 13:18:58 kernel: RAX: 0000000000000003 RBX: ffff8dbef74e2000 RCX: ffff9aff054bfd88
Jul 08 13:18:58 kernel: RDX: 0000000000000087 RSI: 0000000000000246 RDI: 0000000000000246
Jul 08 13:18:58 kernel: RBP: ffff9aff054bfe10 R08: 0000000000000000 R09: 00000000000000cb
Jul 08 13:18:58 kernel: R10: ffff8dbf06e93110 R11: ffff8dbf0dd6c870 R12: ffff8dbeb7d60000
Jul 08 13:18:58 kernel: R13: ffff8dbef74e2000 R14: 0000000000000003 R15: ffff8dbef74e2510
Jul 08 13:18:58 kernel: FS:  00007f7e0e266740(0000) GS:ffff8dbf0dac0000(0000) knlGS:0000000000000000
Jul 08 13:18:58 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 08 13:18:58 kernel: CR2: 000055b3f5387098 CR3: 00000004424a6002 CR4: 00000000007606e0
Jul 08 13:18:58 kernel: PKRU: 55555554
Jul 08 13:18:58 kernel: Call Trace:
Jul 08 13:18:58 kernel:  nv_set_system_power_state+0x224/0x3c0 [nvidia]
Jul 08 13:18:58 kernel:  nv_procfs_write_suspend+0xe7/0x140 [nvidia]
Jul 08 13:18:58 kernel:  proc_reg_write+0x66/0x90
Jul 08 13:18:58 kernel:  vfs_write+0xc9/0x200
Jul 08 13:18:58 kernel:  ksys_write+0x67/0xe0
Jul 08 13:18:58 kernel:  __x64_sys_write+0x1a/0x20
Jul 08 13:18:58 kernel:  do_syscall_64+0x49/0xc0
Jul 08 13:18:58 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 08 13:18:58 kernel: RIP: 0033:0x7f7e0e37a1e7
Jul 08 13:18:58 kernel: Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
Jul 08 13:18:58 kernel: RSP: 002b:00007fffd37b09c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jul 08 13:18:58 kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f7e0e37a1e7
Jul 08 13:18:58 kernel: RDX: 0000000000000007 RSI: 0000561abd2ccb30 RDI: 0000000000000001
Jul 08 13:18:58 kernel: RBP: 0000561abd2ccb30 R08: 000000000000000a R09: 0000000000000006
Jul 08 13:18:58 kernel: R10: 0000561abc597017 R11: 0000000000000246 R12: 0000000000000007
Jul 08 13:18:58 kernel: R13: 00007f7e0e4556a0 R14: 00007f7e0e4564a0 R15: 00007f7e0e4558a0
Jul 08 13:18:58 kernel: ---[ end trace 2171d60572b69fb9 ]---
Jul 08 13:18:58 kernel: ------------[ cut here ]------------
Jul 08 13:18:58 kernel: WARNING: CPU: 3 PID: 3133 at /var/lib/dkms/nvidia/465.31/build/nvidia/nv.c:4104 nv_set_system_power_state+0x2c1/0x3c0 [nvidia]
Jul 08 13:18:58 kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 lib>
Jul 08 13:18:58 kernel:  crypto_simd cryptd snd_pcm glue_helper drm_kms_helper snd_seq_device rapl intel_cstate efi_pstore wmi_bmof intel_wmi_thunderbolt input_leds joydev snd_timer mxm_wmi cec snd rc_core ee1004 fb_sys_fops so>
Jul 08 13:18:58 kernel: CPU: 3 PID: 3133 Comm: nvidia-sleep.sh Tainted: P        W  OE     5.8.0-59-generic #66~20.04.1-Ubuntu
Jul 08 13:18:58 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C80/MAG Z490 TOMAHAWK (MS-7C80), BIOS 1.80 04/19/2021
Jul 08 13:18:58 kernel: RIP: 0010:nv_set_system_power_state+0x2c1/0x3c0 [nvidia]
Jul 08 13:18:58 kernel: Code: 00 4d 85 e4 0f 84 4a ff ff ff 41 83 fd 02 74 e9 49 8b bc 24 78 02 00 00 be 02 00 00 00 e8 b7 d1 ff ff 85 c0 74 d3 0f 0b eb cf <0f> 0b e9 64 ff ff ff 48 c7 c7 70 49 26 c3 e8 9c d8 c0 da e8 d7 12
Jul 08 13:18:58 kernel: RSP: 0018:ffff9aff054bfe20 EFLAGS: 00010206
Jul 08 13:18:58 kernel: RAX: 0000000000000003 RBX: 0000000000000002 RCX: 00000000000000bc
Jul 08 13:18:58 kernel: RDX: 00000000000000bb RSI: 1f704ba77eaa6627 RDI: 00002d3ff2010d40
Jul 08 13:18:58 kernel: RBP: ffff9aff054bfe50 R08: 0000000000000000 R09: 00000000000000cb
Jul 08 13:18:58 kernel: R10: ffff8dbf06e93110 R11: ffff8dbf0dd6c870 R12: ffff8dbef74e2000
Jul 08 13:18:58 kernel: R13: 0000000000000000 R14: ffff9aff054bfef0 R15: 0000561abd2ccb30
Jul 08 13:18:58 kernel: FS:  00007f7e0e266740(0000) GS:ffff8dbf0dac0000(0000) knlGS:0000000000000000
Jul 08 13:18:58 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 08 13:18:58 kernel: CR2: 000055b3f5387098 CR3: 00000004424a6002 CR4: 00000000007606e0
Jul 08 13:18:58 kernel: PKRU: 55555554
Jul 08 13:18:58 kernel: Call Trace:
Jul 08 13:18:58 kernel:  nv_procfs_write_suspend+0xe7/0x140 [nvidia]
Jul 08 13:18:58 kernel:  proc_reg_write+0x66/0x90
Jul 08 13:18:58 kernel:  vfs_write+0xc9/0x200
Jul 08 13:18:58 kernel:  ksys_write+0x67/0xe0
Jul 08 13:18:58 kernel:  __x64_sys_write+0x1a/0x20
Jul 08 13:18:58 kernel:  do_syscall_64+0x49/0xc0
Jul 08 13:18:58 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 08 13:18:58 kernel: RIP: 0033:0x7f7e0e37a1e7
Jul 08 13:18:58 kernel: Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
Jul 08 13:18:58 kernel: RSP: 002b:00007fffd37b09c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jul 08 13:18:58 kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f7e0e37a1e7
Jul 08 13:18:58 kernel: RDX: 0000000000000007 RSI: 0000561abd2ccb30 RDI: 0000000000000001
Jul 08 13:18:58 kernel: RBP: 0000561abd2ccb30 R08: 000000000000000a R09: 0000000000000006
Jul 08 13:18:58 kernel: R10: 0000561abc597017 R11: 0000000000000246 R12: 0000000000000007
Jul 08 13:18:58 kernel: R13: 00007f7e0e4556a0 R14: 00007f7e0e4564a0 R15: 00007f7e0e4558a0
Jul 08 13:18:58 kernel: ---[ end trace 2171d60572b69fba ]---
Jul 08 13:19:01 kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Jul 08 13:19:02 kernel: e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jul 08 13:19:03 kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000957d:0:0:428

It looks like the call trace is a little different, and it doesn’t crash everytime, it happens seemingly randomly. The solution mentioned in the other thread @generix doesn’t work for me, and the solution found by @thesourcehim causes it to freeze on every resume (froze everytime in around 6 runs).

System:
OS: Ubuntu 20.04
Kernel: 5.8.0-59
Nvidia: nvidia 465.27
GPU: 980 ti
CPU: 10700k

@npissoawsome my solution is for my specific laptop version. Even if you have similar issue, faulty device could be located on different pci slot, use lspci to check.

@thesourcehim I checked lspci, my card was on the same PCI slow as yours actually haha

Enabling persistence mode with nvidia-smi -pm ENABLED fixes this issues for me.