Kernel 5.6: system freeze when resuming from suspend or hibernate

Starting from kernel 5.6 on my ASUS N56VB laptop (GeForce GT 740M) I can not resume from suspend or hibernate if nvidia driver (440.82) is installed. If I suspend from desktop, screen goes black when resuming and no input works. If I suspend from another tty, then upon resume I can see previous text, bit can not input anything, I can switch to another tty using Ctrl+Alt+Fn, but other ttys are also frozen, and that’s the only combination that works, even Ctrl+Alt+Del doesn’t. The only option is to hard-reset. Journalctl or directly inspecting /var/log/messages do not yield anything useful, no new lines after:
kernel: PM: suspend entry (deep)
kernel: Filesystems sync: 0.145 seconds
This does not happen in kernel 5.5.18, only in 5.6 (tried 5.6.6, 5.6.7, 5.6.8).
OS: Fedora 31 x86_64

I managed to get some log from journalctl having configured nvidia power management services:

Oct 09 21:56:57 kernel: watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [nvidia-sleep.sh:6044]
Oct 09 21:56:57 kernel: Modules linked in: snd_seq_dummy snd_hrtimer rfcomm ccm nf_nat_tftp nf_nat_ftp nft_masq vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nf_conntrack_sane nf_conntrack_>
Oct 09 21:56:57 kernel: at24 iTCO_wdt intel_pmc_bxt iTCO_vendor_support snd_hda_codec_hdmi kvm_intel mc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm snd_hda_intel snd_>
Oct 09 21:56:57 kernel: CPU: 6 PID: 6044 Comm: nvidia-sleep.sh Tainted: P OEL 5.8.14-200.fc32.x86_64 #1
Oct 09 21:56:57 kernel: Hardware name: ASUSTeK COMPUTER INC. N56VB/N56VB, BIOS N56VB.202 01/21/2013
Oct 09 21:56:57 kernel: RIP: 0010:_nv030848rm+0x7b/0xf0 [nvidia]
Oct 09 21:56:57 kernel: Code: 29 eb 44 39 e3 41 0f 47 dc 45 31 ed 41 89 df 4c 89 f9 4c 89 fe e8 05 b2 6f 00 4c 8b 4d 00 4d 01 f9 41 29 dc 75 c9 83 7d 0c 02 <74> 63 5b b8 01 00 00 00 41 >
Oct 09 21:56:57 kernel: RSP: 0018:ffffb0ce4152fb18 EFLAGS: 00000297
Oct 09 21:56:57 kernel: RAX: ffffb0ce40cf4a68 RBX: 0000000000000014 RCX: 0000000000000000
Oct 09 21:56:57 kernel: RDX: 0000000000000014 RSI: ffff8e5101815924 RDI: ffffb0ce40cf4a7c
Oct 09 21:56:57 kernel: RBP: ffff8e51018158c0 R08: 0000000000000000 R09: ffff8e5101815924
Oct 09 21:56:57 kernel: R10: 000000000000179c R11: ffff8e5116065808 R12: 0000000000000000
Oct 09 21:56:57 kernel: R13: 0000000000000000 R14: ffffb0ce40ce8008 R15: 0000000000000014
Oct 09 21:56:57 kernel: FS: 00007f8896420740(0000) GS:ffff8e511f780000(0000) knlGS:0000000000000000
Oct 09 21:56:57 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 09 21:56:57 kernel: CR2: 00005615e7193008 CR3: 00000003b40e4001 CR4: 00000000001606e0
Oct 09 21:56:57 kernel: Call Trace:
Oct 09 21:56:57 kernel: ? _nv030850rm+0x3a/0xc0 [nvidia]
Oct 09 21:56:57 kernel: ? _nv030847rm+0x223/0x440 [nvidia]
Oct 09 21:56:57 kernel: ? _nv030844rm+0x39/0x50 [nvidia]
Oct 09 21:56:57 kernel: ? _nv021052rm+0x10d/0x1a0 [nvidia]
Oct 09 21:56:57 kernel: ? _nv021009rm+0x135/0x190 [nvidia]
Oct 09 21:56:57 kernel: ? _nv039024rm+0x74/0x120 [nvidia]
Oct 09 21:56:57 kernel: ? _nv038996rm+0x19d/0x280 [nvidia]
Oct 09 21:56:57 kernel: ? _nv006916rm+0x2960/0x2d10 [nvidia]
Oct 09 21:56:57 kernel: ? _nv000236rm+0x721/0xc90 [nvidia]
Oct 09 21:56:57 kernel: ? _nv020938rm+0x69f/0x790 [nvidia]
Oct 09 21:56:57 kernel: ? _nv021046rm+0x4b/0xd0 [nvidia]
Oct 09 21:56:57 kernel: ? _nv000723rm+0x26c/0x2e0 [nvidia]
Oct 09 21:56:57 kernel: ? rm_power_management+0x1cd/0x200 [nvidia]
Oct 09 21:56:57 kernel: ? kmem_cache_alloc+0x70/0x220
Oct 09 21:56:57 kernel: ? nv_power_management+0xea/0x130 [nvidia]
Oct 09 21:56:57 kernel: ? nvidia_resume.isra.0+0x57/0x70 [nvidia]
Oct 09 21:56:57 kernel: ? nv_set_system_power_state+0x2b8/0x3c0 [nvidia]
Oct 09 21:56:57 kernel: ? nv_procfs_write_suspend+0xec/0x140 [nvidia]
Oct 09 21:56:57 kernel: ? proc_reg_write+0x51/0x90
Oct 09 21:56:57 kernel: ? vfs_write+0xc7/0x1f0
Oct 09 21:56:57 kernel: ? ksys_write+0x4f/0xc0
Oct 09 21:56:57 kernel: ? do_syscall_64+0x4d/0x90
Oct 09 21:56:57 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 09 21:57:06 kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 6-… } 64017 jiffies s: 493 root: 0x40/.
Oct 09 21:57:06 kernel: rcu: blocking rcu_node structures:
Oct 09 21:57:06 kernel: Task dump for CPU 6:
Oct 09 21:57:06 kernel: nvidia-sleep.sh R running task 0 6044 1 0x00004008
Oct 09 21:57:06 kernel: Call Trace:

Tested again with kernel 5.8.17, because I saw that some PCI bug related to nvidia was fixed in that kernel: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.8.17 (search commit b6bd62dc59e7af25a0b1af432a553c0f504ce64c). However problem persists with the same trace log.

Driver 460.32.03, kernel 5.10.6, problem is still here, card does not wake up. Also tried to write ‘resume’ to /proc/driver/nvidia/suspend, no luck, the command just hangs forever.

Did you already try setting kernel parameter
acpi_osi=! acpi_osi=“Windows 2009”
?

Yes, tried that, no difference.

Maybe there’s systemd services that need to be enabled? At least that’s the case here:

systemctl status nvidia-hibernate.service nvidia-resume.service nvidia-suspend.service

● nvidia-hibernate.service - NVIDIA system hibernate actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-hibernate.service; enabled; vendor preset: disabled)
Active: inactive (dead)

● nvidia-resume.service - NVIDIA system resume actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-resume.service; enabled; vendor preset: disabled)
Active: inactive (dead)

● nvidia-suspend.service - NVIDIA system suspend actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-suspend.service; enabled; vendor preset: disabled)
Active: inactive (dead)

Yeah, enable those services:

systemctl enable nvidia-hibernate.service nvidia-resume.service nvidia-suspend.service

Now reboot (because I’m not sure if just starting them manually with systemctl start is enough here.) Maybe this will fix the issue.

As I posted before they are already enabled.

What you posted shows that they are not enabled. (Active: inactive (dead).) “Loaded” does not mean the service is enabled. It just means systemd has seen the service file and loaded it, not that it actually started it.

Enable these services and reboot. The status command will then look very different.

I did what suggested (again) and the output of status command is exactly the same. Look at the word ‘enabled’ after service file name. This indicates that the service is indeed enabled. As for why it’s not active? Well, looking at service file:

[Unit]
Description=NVIDIA system suspend actions
Before=systemd-suspend.service

[Service]
Type=oneshot
ExecStart=/usr/bin/logger -t suspend -s “nvidia-suspend.service”
ExecStart=/usr/bin/nvidia-sleep.sh “suspend”

[Install]
RequiredBy=systemd-suspend.service

It seems that it runs when system goes to specific state (suspend in this example) and it is oneshot, so why would it be active all the time?

Yeah, you’re right – these services are launched before and after systemd actually suspends the system in order to give the kernel driver an opportunity to save vidmem contents before the GPU is powered off. Your very first post shows that this mechanism was working because the kernel printed Comm: nvidia-sleep.sh, which is the script that is launched by the systemd units.

Can you please run nvidia-bug-report.sh after reproducing the soft lockup problem with the latest driver and attach the nvidia-bug-report.log.gz file here?

I reproduced the problem and quickly switched to another tty to run the script. The script hung during execution and ‘cat’ subprocess was eating 100% cpu. I suppose it tried to read something from nvidia device file. I tried to execute the script with --safe-mode and --extra-system-data parameters, but it still hangs. Attaching resulting file, not sure if it has enough data. I’m not an expert here, but it looks like an infinite loop of sorts inside the kernel module (as nvidia-sleep.sh also consumes 100% cpu).
nvidia-bug-report.log.gz (1.1 KB)

I reproduced the problem with kernel 5.11.13 with acpi_osi=! acpi_osi=“Windows 2009” set. Attaching dmesg output that contains some stack trace and nvidia-bug-report log (it is formed under normal conditions though, i.e. before freeze, as I can not get full log when freeze occurs).
dmesg.txt (119.6 KB)
nvidia-bug-report.log.gz (1.1 MB)

465.24.02 - same thing happening

I tried to use older kernel (5.3.7) but the problem persists even there (465.27 driver). Looks like I was wrong initially and it is not related to 5.5-5.6 transition.

Looking at the logs, it seems you tried to enable dynamic power-management on a platform that doesn’t support this. Shouldn’t matter but in your case the nvidia audio device can’t be turned on anymore (did you use the udev rules to set runpm to auto). Don’t know if this has an influence but you should rather remove that.

I didn’t place any udev rules manually but maybe some package did. Can you hint what should I look for? Specific rule contents? Also, I’ll remove NVreg_DynamicPowerManagement from my kernel cmdline, thanks.

Please see this:
https://forums.developer.nvidia.com/t/no-option-for-audio-over-displayport-hdmi/175889/2?u=generix