Kernel 5.6: system freeze when resuming from suspend or hibernate

thesourcehim · May 3, 2021, 4:12pm

I followed the guide from that post but it seems nvidia audio can not be turned on no matter what.

snd_hda_intel 0000:01:00.1: can’t change power state from D3cold to D0 (config space inaccessible)
snd_hda_intel 0000:01:00.1: can’t change power state from D3hot to D0 (config space inaccessible)

01:00.1 Audio device: NVIDIA Corporation GK107 HDMI Audio Controller (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

I have dual boot with Windows 10, I checked in Windows and that audio device is not present either. But then again, HDMI output I have comes from integrated graphics, not nvidia. So 740M card doesn’t have any outputs. My guess is that audio device is disabled by manufacturer’s (ASUS) design. Windows 10 comes from stand by mode just fine, maybe nvidia driver tries to power audio unconditionally.
I’ll try to remove those udev rules completely.

UPDATE: Nope, that didn’t help

thesourcehim · May 4, 2021, 8:34am

Thank you, @generix, for pointing me in the right direction! I added the simple udev rule to remove problematic device completely:

cat /etc/udev/rules.d/10-remove-nvidia-audio.rules
ACTION==“add”, KERNEL==“0000:01:00.1”, SUBSYSTEM==“pci”, RUN+=“/bin/sh -c ‘echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove’”

Now lspci doesn’t list the device. And the problem is gone! Laptop wakes up perfectly!

aplattner · May 5, 2021, 7:42am

Nice sleuthing tracking that down.

I wonder if this problem is related to a workaround in the Linux kernel that tries to enable audio on NVIDIA devices in laptops that normally disable them at boot and expect the Windows driver to enable it dynamically. There was a thread about this recently: [Nouveau] [PATCH v2] ALSA: hda: Continue to probe when codec probe fails

It’s possible that this Linux kernel quirk is enabling the audio function on your GPU when it really doesn’t have one, causing this problem.

generix · May 5, 2021, 9:24am

The quirk was introduced in kernel 5.4, the discussions back then resembled the new one, really a déja vue.
Though I guess while the dead audio device being the trigger, the cause is the new power management, as to why it needs to access the audio device and hangs if it’s inaccessible. Since the other thread linked here showed this can happen in other ways, too.

npissoawsome · July 8, 2021, 5:55pm

Also experiencing a very similar issue to this.

Jul 08 13:18:58 kernel: ------------[ cut here ]------------
Jul 08 13:18:58 kernel: WARNING: CPU: 3 PID: 3133 at /var/lib/dkms/nvidia/465.31/build/nvidia/nv.c:3909 nv_restore_user_channels+0xce/0xe0 [nvidia]
Jul 08 13:18:58 kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 lib>
Jul 08 13:18:58 kernel:  crypto_simd cryptd snd_pcm glue_helper drm_kms_helper snd_seq_device rapl intel_cstate efi_pstore wmi_bmof intel_wmi_thunderbolt input_leds joydev snd_timer mxm_wmi cec snd rc_core ee1004 fb_sys_fops so>
Jul 08 13:18:58 kernel: CPU: 3 PID: 3133 Comm: nvidia-sleep.sh Tainted: P           OE     5.8.0-59-generic #66~20.04.1-Ubuntu
Jul 08 13:18:58 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C80/MAG Z490 TOMAHAWK (MS-7C80), BIOS 1.80 04/19/2021
Jul 08 13:18:58 kernel: RIP: 0010:nv_restore_user_channels+0xce/0xe0 [nvidia]
Jul 08 13:18:58 kernel: Code: 08 c1 da be 01 00 00 00 4c 89 ef e8 9c 9c 00 00 48 89 df e8 24 09 c1 da ba 02 00 00 00 4c 89 ee 4c 89 e7 e8 44 eb 99 00 eb 93 <0f> 0b eb c6 41 be 51 00 00 00 eb 9e 66 0f 1f 44 00 00 0f 1f 44 00
Jul 08 13:18:58 kernel: RSP: 0018:ffff9aff054bfde8 EFLAGS: 00010206
Jul 08 13:18:58 kernel: RAX: 0000000000000003 RBX: ffff8dbef74e2000 RCX: ffff9aff054bfd88
Jul 08 13:18:58 kernel: RDX: 0000000000000087 RSI: 0000000000000246 RDI: 0000000000000246
Jul 08 13:18:58 kernel: RBP: ffff9aff054bfe10 R08: 0000000000000000 R09: 00000000000000cb
Jul 08 13:18:58 kernel: R10: ffff8dbf06e93110 R11: ffff8dbf0dd6c870 R12: ffff8dbeb7d60000
Jul 08 13:18:58 kernel: R13: ffff8dbef74e2000 R14: 0000000000000003 R15: ffff8dbef74e2510
Jul 08 13:18:58 kernel: FS:  00007f7e0e266740(0000) GS:ffff8dbf0dac0000(0000) knlGS:0000000000000000
Jul 08 13:18:58 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 08 13:18:58 kernel: CR2: 000055b3f5387098 CR3: 00000004424a6002 CR4: 00000000007606e0
Jul 08 13:18:58 kernel: PKRU: 55555554
Jul 08 13:18:58 kernel: Call Trace:
Jul 08 13:18:58 kernel:  nv_set_system_power_state+0x224/0x3c0 [nvidia]
Jul 08 13:18:58 kernel:  nv_procfs_write_suspend+0xe7/0x140 [nvidia]
Jul 08 13:18:58 kernel:  proc_reg_write+0x66/0x90
Jul 08 13:18:58 kernel:  vfs_write+0xc9/0x200
Jul 08 13:18:58 kernel:  ksys_write+0x67/0xe0
Jul 08 13:18:58 kernel:  __x64_sys_write+0x1a/0x20
Jul 08 13:18:58 kernel:  do_syscall_64+0x49/0xc0
Jul 08 13:18:58 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 08 13:18:58 kernel: RIP: 0033:0x7f7e0e37a1e7
Jul 08 13:18:58 kernel: Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
Jul 08 13:18:58 kernel: RSP: 002b:00007fffd37b09c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jul 08 13:18:58 kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f7e0e37a1e7
Jul 08 13:18:58 kernel: RDX: 0000000000000007 RSI: 0000561abd2ccb30 RDI: 0000000000000001
Jul 08 13:18:58 kernel: RBP: 0000561abd2ccb30 R08: 000000000000000a R09: 0000000000000006
Jul 08 13:18:58 kernel: R10: 0000561abc597017 R11: 0000000000000246 R12: 0000000000000007
Jul 08 13:18:58 kernel: R13: 00007f7e0e4556a0 R14: 00007f7e0e4564a0 R15: 00007f7e0e4558a0
Jul 08 13:18:58 kernel: ---[ end trace 2171d60572b69fb9 ]---
Jul 08 13:18:58 kernel: ------------[ cut here ]------------
Jul 08 13:18:58 kernel: WARNING: CPU: 3 PID: 3133 at /var/lib/dkms/nvidia/465.31/build/nvidia/nv.c:4104 nv_set_system_power_state+0x2c1/0x3c0 [nvidia]
Jul 08 13:18:58 kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 lib>
Jul 08 13:18:58 kernel:  crypto_simd cryptd snd_pcm glue_helper drm_kms_helper snd_seq_device rapl intel_cstate efi_pstore wmi_bmof intel_wmi_thunderbolt input_leds joydev snd_timer mxm_wmi cec snd rc_core ee1004 fb_sys_fops so>
Jul 08 13:18:58 kernel: CPU: 3 PID: 3133 Comm: nvidia-sleep.sh Tainted: P        W  OE     5.8.0-59-generic #66~20.04.1-Ubuntu
Jul 08 13:18:58 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C80/MAG Z490 TOMAHAWK (MS-7C80), BIOS 1.80 04/19/2021
Jul 08 13:18:58 kernel: RIP: 0010:nv_set_system_power_state+0x2c1/0x3c0 [nvidia]
Jul 08 13:18:58 kernel: Code: 00 4d 85 e4 0f 84 4a ff ff ff 41 83 fd 02 74 e9 49 8b bc 24 78 02 00 00 be 02 00 00 00 e8 b7 d1 ff ff 85 c0 74 d3 0f 0b eb cf <0f> 0b e9 64 ff ff ff 48 c7 c7 70 49 26 c3 e8 9c d8 c0 da e8 d7 12
Jul 08 13:18:58 kernel: RSP: 0018:ffff9aff054bfe20 EFLAGS: 00010206
Jul 08 13:18:58 kernel: RAX: 0000000000000003 RBX: 0000000000000002 RCX: 00000000000000bc
Jul 08 13:18:58 kernel: RDX: 00000000000000bb RSI: 1f704ba77eaa6627 RDI: 00002d3ff2010d40
Jul 08 13:18:58 kernel: RBP: ffff9aff054bfe50 R08: 0000000000000000 R09: 00000000000000cb
Jul 08 13:18:58 kernel: R10: ffff8dbf06e93110 R11: ffff8dbf0dd6c870 R12: ffff8dbef74e2000
Jul 08 13:18:58 kernel: R13: 0000000000000000 R14: ffff9aff054bfef0 R15: 0000561abd2ccb30
Jul 08 13:18:58 kernel: FS:  00007f7e0e266740(0000) GS:ffff8dbf0dac0000(0000) knlGS:0000000000000000
Jul 08 13:18:58 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 08 13:18:58 kernel: CR2: 000055b3f5387098 CR3: 00000004424a6002 CR4: 00000000007606e0
Jul 08 13:18:58 kernel: PKRU: 55555554
Jul 08 13:18:58 kernel: Call Trace:
Jul 08 13:18:58 kernel:  nv_procfs_write_suspend+0xe7/0x140 [nvidia]
Jul 08 13:18:58 kernel:  proc_reg_write+0x66/0x90
Jul 08 13:18:58 kernel:  vfs_write+0xc9/0x200
Jul 08 13:18:58 kernel:  ksys_write+0x67/0xe0
Jul 08 13:18:58 kernel:  __x64_sys_write+0x1a/0x20
Jul 08 13:18:58 kernel:  do_syscall_64+0x49/0xc0
Jul 08 13:18:58 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 08 13:18:58 kernel: RIP: 0033:0x7f7e0e37a1e7
Jul 08 13:18:58 kernel: Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
Jul 08 13:18:58 kernel: RSP: 002b:00007fffd37b09c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jul 08 13:18:58 kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f7e0e37a1e7
Jul 08 13:18:58 kernel: RDX: 0000000000000007 RSI: 0000561abd2ccb30 RDI: 0000000000000001
Jul 08 13:18:58 kernel: RBP: 0000561abd2ccb30 R08: 000000000000000a R09: 0000000000000006
Jul 08 13:18:58 kernel: R10: 0000561abc597017 R11: 0000000000000246 R12: 0000000000000007
Jul 08 13:18:58 kernel: R13: 00007f7e0e4556a0 R14: 00007f7e0e4564a0 R15: 00007f7e0e4558a0
Jul 08 13:18:58 kernel: ---[ end trace 2171d60572b69fba ]---
Jul 08 13:19:01 kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Jul 08 13:19:02 kernel: e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jul 08 13:19:03 kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000957d:0:0:428

It looks like the call trace is a little different, and it doesn’t crash everytime, it happens seemingly randomly. The solution mentioned in the other thread @generix doesn’t work for me, and the solution found by @thesourcehim causes it to freeze on every resume (froze everytime in around 6 runs).

System:
OS: Ubuntu 20.04
Kernel: 5.8.0-59
Nvidia: nvidia 465.27
GPU: 980 ti
CPU: 10700k

thesourcehim · July 15, 2021, 6:32am

@npissoawsome my solution is for my specific laptop version. Even if you have similar issue, faulty device could be located on different pci slot, use lspci to check.

npissoawsome · July 16, 2021, 12:27am

@thesourcehim I checked lspci, my card was on the same PCI slow as yours actually haha

npissoawsome · August 18, 2021, 1:29pm

Enabling persistence mode with nvidia-smi -pm ENABLED fixes this issues for me.