Kernel 5.6: system freeze when resuming from suspend or hibernate

thesourcehim · April 30, 2020, 6:42pm

Starting from kernel 5.6 on my ASUS N56VB laptop (GeForce GT 740M) I can not resume from suspend or hibernate if nvidia driver (440.82) is installed. If I suspend from desktop, screen goes black when resuming and no input works. If I suspend from another tty, then upon resume I can see previous text, bit can not input anything, I can switch to another tty using Ctrl+Alt+Fn, but other ttys are also frozen, and that’s the only combination that works, even Ctrl+Alt+Del doesn’t. The only option is to hard-reset. Journalctl or directly inspecting /var/log/messages do not yield anything useful, no new lines after:
kernel: PM: suspend entry (deep)
kernel: Filesystems sync: 0.145 seconds
This does not happen in kernel 5.5.18, only in 5.6 (tried 5.6.6, 5.6.7, 5.6.8).
OS: Fedora 31 x86_64

thesourcehim · October 9, 2020, 7:08pm

I managed to get some log from journalctl having configured nvidia power management services:

Oct 09 21:56:57 kernel: watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [nvidia-sleep.sh:6044]
Oct 09 21:56:57 kernel: Modules linked in: snd_seq_dummy snd_hrtimer rfcomm ccm nf_nat_tftp nf_nat_ftp nft_masq vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nf_conntrack_sane nf_conntrack_>
Oct 09 21:56:57 kernel: at24 iTCO_wdt intel_pmc_bxt iTCO_vendor_support snd_hda_codec_hdmi kvm_intel mc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm snd_hda_intel snd_>
Oct 09 21:56:57 kernel: CPU: 6 PID: 6044 Comm: nvidia-sleep.sh Tainted: P OEL 5.8.14-200.fc32.x86_64 #1
Oct 09 21:56:57 kernel: Hardware name: ASUSTeK COMPUTER INC. N56VB/N56VB, BIOS N56VB.202 01/21/2013
Oct 09 21:56:57 kernel: RIP: 0010:_nv030848rm+0x7b/0xf0 [nvidia]
Oct 09 21:56:57 kernel: Code: 29 eb 44 39 e3 41 0f 47 dc 45 31 ed 41 89 df 4c 89 f9 4c 89 fe e8 05 b2 6f 00 4c 8b 4d 00 4d 01 f9 41 29 dc 75 c9 83 7d 0c 02 <74> 63 5b b8 01 00 00 00 41 >
Oct 09 21:56:57 kernel: RSP: 0018:ffffb0ce4152fb18 EFLAGS: 00000297
Oct 09 21:56:57 kernel: RAX: ffffb0ce40cf4a68 RBX: 0000000000000014 RCX: 0000000000000000
Oct 09 21:56:57 kernel: RDX: 0000000000000014 RSI: ffff8e5101815924 RDI: ffffb0ce40cf4a7c
Oct 09 21:56:57 kernel: RBP: ffff8e51018158c0 R08: 0000000000000000 R09: ffff8e5101815924
Oct 09 21:56:57 kernel: R10: 000000000000179c R11: ffff8e5116065808 R12: 0000000000000000
Oct 09 21:56:57 kernel: R13: 0000000000000000 R14: ffffb0ce40ce8008 R15: 0000000000000014
Oct 09 21:56:57 kernel: FS: 00007f8896420740(0000) GS:ffff8e511f780000(0000) knlGS:0000000000000000
Oct 09 21:56:57 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 09 21:56:57 kernel: CR2: 00005615e7193008 CR3: 00000003b40e4001 CR4: 00000000001606e0
Oct 09 21:56:57 kernel: Call Trace:
Oct 09 21:56:57 kernel: ? _nv030850rm+0x3a/0xc0 [nvidia]
Oct 09 21:56:57 kernel: ? _nv030847rm+0x223/0x440 [nvidia]
Oct 09 21:56:57 kernel: ? _nv030844rm+0x39/0x50 [nvidia]
Oct 09 21:56:57 kernel: ? _nv021052rm+0x10d/0x1a0 [nvidia]
Oct 09 21:56:57 kernel: ? _nv021009rm+0x135/0x190 [nvidia]
Oct 09 21:56:57 kernel: ? _nv039024rm+0x74/0x120 [nvidia]
Oct 09 21:56:57 kernel: ? _nv038996rm+0x19d/0x280 [nvidia]
Oct 09 21:56:57 kernel: ? _nv006916rm+0x2960/0x2d10 [nvidia]
Oct 09 21:56:57 kernel: ? _nv000236rm+0x721/0xc90 [nvidia]
Oct 09 21:56:57 kernel: ? _nv020938rm+0x69f/0x790 [nvidia]
Oct 09 21:56:57 kernel: ? _nv021046rm+0x4b/0xd0 [nvidia]
Oct 09 21:56:57 kernel: ? _nv000723rm+0x26c/0x2e0 [nvidia]
Oct 09 21:56:57 kernel: ? rm_power_management+0x1cd/0x200 [nvidia]
Oct 09 21:56:57 kernel: ? kmem_cache_alloc+0x70/0x220
Oct 09 21:56:57 kernel: ? nv_power_management+0xea/0x130 [nvidia]
Oct 09 21:56:57 kernel: ? nvidia_resume.isra.0+0x57/0x70 [nvidia]
Oct 09 21:56:57 kernel: ? nv_set_system_power_state+0x2b8/0x3c0 [nvidia]
Oct 09 21:56:57 kernel: ? nv_procfs_write_suspend+0xec/0x140 [nvidia]
Oct 09 21:56:57 kernel: ? proc_reg_write+0x51/0x90
Oct 09 21:56:57 kernel: ? vfs_write+0xc7/0x1f0
Oct 09 21:56:57 kernel: ? ksys_write+0x4f/0xc0
Oct 09 21:56:57 kernel: ? do_syscall_64+0x4d/0x90
Oct 09 21:56:57 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 09 21:57:06 kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 6-… } 64017 jiffies s: 493 root: 0x40/.
Oct 09 21:57:06 kernel: rcu: blocking rcu_node structures:
Oct 09 21:57:06 kernel: Task dump for CPU 6:
Oct 09 21:57:06 kernel: nvidia-sleep.sh R running task 0 6044 1 0x00004008
Oct 09 21:57:06 kernel: Call Trace:

thesourcehim · October 30, 2020, 12:43pm

Tested again with kernel 5.8.17, because I saw that some PCI bug related to nvidia was fixed in that kernel: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.8.17 (search commit b6bd62dc59e7af25a0b1af432a553c0f504ce64c). However problem persists with the same trace log.

thesourcehim · January 13, 2021, 9:16am

Driver 460.32.03, kernel 5.10.6, problem is still here, card does not wake up. Also tried to write ‘resume’ to /proc/driver/nvidia/suspend, no luck, the command just hangs forever.

generix · January 13, 2021, 1:48pm

Did you already try setting kernel parameter
acpi_osi=! acpi_osi=“Windows 2009”
?

thesourcehim · January 13, 2021, 4:34pm

Yes, tried that, no difference.

RealNC · January 15, 2021, 11:46am

Maybe there’s systemd services that need to be enabled? At least that’s the case here:

systemctl status nvidia-hibernate.service nvidia-resume.service nvidia-suspend.service

thesourcehim · January 15, 2021, 12:46pm

● nvidia-hibernate.service - NVIDIA system hibernate actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-hibernate.service; enabled; vendor preset: disabled)
Active: inactive (dead)

● nvidia-resume.service - NVIDIA system resume actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-resume.service; enabled; vendor preset: disabled)
Active: inactive (dead)

● nvidia-suspend.service - NVIDIA system suspend actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-suspend.service; enabled; vendor preset: disabled)
Active: inactive (dead)

RealNC · January 21, 2021, 3:40pm

Yeah, enable those services:

systemctl enable nvidia-hibernate.service nvidia-resume.service nvidia-suspend.service

Now reboot (because I’m not sure if just starting them manually with systemctl start is enough here.) Maybe this will fix the issue.

thesourcehim · January 21, 2021, 3:58pm

As I posted before they are already enabled.

RealNC · January 21, 2021, 4:54pm

What you posted shows that they are not enabled. (Active: inactive (dead).) “Loaded” does not mean the service is enabled. It just means systemd has seen the service file and loaded it, not that it actually started it.

Enable these services and reboot. The status command will then look very different.

thesourcehim · January 21, 2021, 5:48pm

I did what suggested (again) and the output of status command is exactly the same. Look at the word ‘enabled’ after service file name. This indicates that the service is indeed enabled. As for why it’s not active? Well, looking at service file:

[Unit]
Description=NVIDIA system suspend actions
Before=systemd-suspend.service

[Service]
Type=oneshot
ExecStart=/usr/bin/logger -t suspend -s “nvidia-suspend.service”
ExecStart=/usr/bin/nvidia-sleep.sh “suspend”

[Install]
RequiredBy=systemd-suspend.service

It seems that it runs when system goes to specific state (suspend in this example) and it is oneshot, so why would it be active all the time?

aplattner · January 21, 2021, 6:38pm

Yeah, you’re right – these services are launched before and after systemd actually suspends the system in order to give the kernel driver an opportunity to save vidmem contents before the GPU is powered off. Your very first post shows that this mechanism was working because the kernel printed Comm: nvidia-sleep.sh, which is the script that is launched by the systemd units.

Can you please run nvidia-bug-report.sh after reproducing the soft lockup problem with the latest driver and attach the nvidia-bug-report.log.gz file here?

thesourcehim · January 22, 2021, 6:06am

I reproduced the problem and quickly switched to another tty to run the script. The script hung during execution and ‘cat’ subprocess was eating 100% cpu. I suppose it tried to read something from nvidia device file. I tried to execute the script with --safe-mode and --extra-system-data parameters, but it still hangs. Attaching resulting file, not sure if it has enough data. I’m not an expert here, but it looks like an infinite loop of sorts inside the kernel module (as nvidia-sleep.sh also consumes 100% cpu).
nvidia-bug-report.log.gz (1.1 KB)

thesourcehim · April 13, 2021, 7:06am

I reproduced the problem with kernel 5.11.13 with acpi_osi=! acpi_osi=“Windows 2009” set. Attaching dmesg output that contains some stack trace and nvidia-bug-report log (it is formed under normal conditions though, i.e. before freeze, as I can not get full log when freeze occurs).
dmesg.txt (119.6 KB)
nvidia-bug-report.log.gz (1.1 MB)

thesourcehim · April 16, 2021, 5:37pm

465.24.02 - same thing happening

thesourcehim · May 3, 2021, 2:38pm

I tried to use older kernel (5.3.7) but the problem persists even there (465.27 driver). Looks like I was wrong initially and it is not related to 5.5-5.6 transition.

generix · May 3, 2021, 3:01pm

Looking at the logs, it seems you tried to enable dynamic power-management on a platform that doesn’t support this. Shouldn’t matter but in your case the nvidia audio device can’t be turned on anymore (did you use the udev rules to set runpm to auto). Don’t know if this has an influence but you should rather remove that.

thesourcehim · May 3, 2021, 3:06pm

I didn’t place any udev rules manually but maybe some package did. Can you hint what should I look for? Specific rule contents? Also, I’ll remove NVreg_DynamicPowerManagement from my kernel cmdline, thanks.

generix · May 3, 2021, 3:08pm

Please see this:
https://forums.developer.nvidia.com/t/no-option-for-audio-over-displayport-hdmi/175889/2?u=generix

Topic		Replies	Views
Device driver crash (unable to handle page fault) after suspend-&-resume with version 555.58.02 on Linux kernel v6.9.9 Linux kernel	13	1838	October 17, 2024
resume from suspend freezes system (GTX 970, Arch Linux, Kernel 4.4/4.7, NVIDIA 370) Linux	171	58265	June 18, 2017
Resume issue after suspend Ubuntu 20.04 Linux	3	3894	November 12, 2021
NVIDIA 470.63.01 driver randomly hangs with no video output when resuming from suspend using the /proc interface on GeForce GTX 960 Linux	7	1710	March 9, 2022
Issues resuming from hybrid-sleep Linux	9	3063	June 13, 2024
Failed to come back from suspend / black screen on resume Linux	4	904	June 22, 2022
565.57.01 won't resume from "suspend to RAM" Linux	34	2500	May 14, 2025
PreserveVideoMemoryAllocations + systemd services causes resume from hibernate to fail Linux	10	4134	May 15, 2025
Nvidia-uvm module bug on suspend Linux	14	1758	December 7, 2023
GPU is crashing after resume from sleep Linux	3	3545	January 30, 2021

Kernel 5.6: system freeze when resuming from suspend or hibernate

Related topics