[FIXED] Suspend / Resume issues with the driver version 470

It seems like the driver version 470 causes a kernel panic upon resuming from suspend, display gets no input signal and keyboard stops responding to input after a few seconds from resuming ( when pressing NumLock, the status lights on the keyboard do not change!) .

I have temporarily downgraded to version 460 until this gets fixed. Logs are attached for both the 470 and 460 ( fully working) versions down below.

nvidia-bug-report.log.470.gz (203.8 KB)

nvidia-bug-report.log.460.gz (234.5 KB)

Machine specs:
OS: Kubuntu 21.04
Kernel: Mainline 5.13.12 (also tried with stock 5.11.xx from Ubuntu)
GPU: GeForce GT 710

4 Likes

Hi. You are not the only one having this problem. I hope developers can solve the problem soon:

Good news for affected users! I found a fix!

A LITTLE BACKGROUND
You may already know that NVIDIA drivers on Linux rely on either of two different methods for power management ( as described here ), which include:

  1. Kernel Driver Callback: Works out of the box with no configuration required, but lacks advanced power management features and preserves only a portion of the video memory.

  2. systemd (/proc/driver/nvidia/suspend): Provides advanced power management features and preserves complete video memory, but requires configuration and setup.

THE CAUSE
Having mentioned the above, upon further inspection I found out the 470 driver migrated to systemd method while previous versions relied on Kernel Driver Callback. Apparently this is broken on some setups and kernels.

THE WORKAROUND
Now it’s obvious we have to revert back to Kernel Driver Callback method for now that the systemd method is broken, and here’s how you can do that:

  • Disable NVIDIA systemd services
sudo systemctl stop nvidia-suspend.service
sudo systemctl stop nvidia-hibernate.service
sudo systemctl stop nvidia-resume.service

sudo systemctl disable nvidia-suspend.service
sudo systemctl disable nvidia-hibernate.service
sudo systemctl disable nvidia-resume.service
  • Remove NVIDIA systemd script
sudo rm /lib/systemd/system-sleep/nvidia

Reboot and you should be able to suspend and resume properly with driver version 470.xx.

NOTE: Backup your configuration just in case, or downgrade the driver if this does not work on your setup. This was tested on Kubuntu 21.04 with GeForce GT 710.

13 Likes

I face the same issue. My laptop could wake up. However, the primary monitor is very dark, which is equivalent to being unusable at all. Secondary monitory is ok.

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 21.04
Release:	21.04
Codename:	hirsute
$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  470.63.01  Tue Aug  3 20:44:16 UTC 2021
GCC version:
$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd000025B8sv000017AAsd000022DEbc03sc00i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-460-server - distro non-free
driver   : nvidia-driver-460 - distro non-free
driver   : nvidia-driver-470 - distro non-free recommended
driver   : xserver-xorg-video-nouveau - distro free builtin

dmesg:

[  130.074778] RIP: 0010:nv_drm_master_set+0x27/0x30 [nvidia_drm]
[  130.074783] Code: 90 b5 df 0f 1f 44 00 00 55 48 8b 47 48 48 8b 78 20 48 8b 05 bb 6c 00 00 48 89 e5 48 8b 40 28 e8 ef ef f1 df 84 c0 74 02 5d c3 <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56
[  130.074785] RSP: 0018:ffff9ee947133b80 EFLAGS: 00010246
[  130.074788] RAX: 0000000000000000 RBX: ffff8c4c5d289200 RCX: 0000000000000008
[  130.074789] RDX: ffffffffc3a3ced8 RSI: 0000000000000292 RDI: ffffffffc3a3cea0
[  130.074791] RBP: ffff9ee947133b80 R08: 0000000000000008 R09: ffff9ee947133b68
[  130.074792] R10: 0000000000000000 R11: ffff8c4b9360991a R12: ffff8c4bdc352900
[  130.074793] R13: ffff8c4b88239800 R14: 0000000000000000 R15: ffff8c4b88239800
[  130.074795] FS:  00007f4e70044c80(0000) GS:ffff8c5c97680000(0000) knlGS:0000000000000000
[  130.074796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  130.074798] CR2: 00007f4e7074c3ea CR3: 000000019b78a004 CR4: 0000000000770ee0
[  130.074800] PKRU: 55555554
[  130.074801] Call Trace:
[  130.074802]  drm_new_set_master+0x7e/0x100 [drm]
[  130.074822]  drm_master_open+0x6e/0xa0 [drm]
[  130.074842]  drm_open+0xf8/0x250 [drm]
[  130.074863]  drm_stub_open+0xba/0x140 [drm]
[  130.074887]  chrdev_open+0xf7/0x220
[  130.074891]  ? cdev_device_add+0x90/0x90
[  130.074894]  do_dentry_open+0x156/0x370
[  130.074899]  vfs_open+0x2d/0x30
[  130.074904]  do_open+0x1c3/0x340
[  130.074907]  path_openat+0x10a/0x1d0
[  130.074910]  ? psi_group_change+0x42/0x220
[  130.074913]  do_filp_open+0x8c/0x130
[  130.074917]  ? __check_object_size+0x1c/0x20
[  130.074920]  do_sys_openat2+0x9b/0x150
[  130.074925]  __x64_sys_openat+0x56/0x90
[  130.074929]  do_syscall_64+0x38/0x90
[  130.074931]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  130.074935] RIP: 0033:0x7f4e704de8db
[  130.074937] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
[  130.074939] RSP: 002b:00007ffd8b6e8f10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  130.074942] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4e704de8db
[  130.074944] RDX: 0000000000000002 RSI: 00007ffd8b6e8fe0 RDI: 00000000ffffff9c
[  130.074945] RBP: 00007ffd8b6e8fe0 R08: 0000000000000000 R09: 00007ffd8b6e8e20
[  130.074946] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[  130.074948] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  130.074950] ---[ end trace 891fb0926d35dabc ]---

I can think of two possible options that may resolve this issue:

  • Enabling NVIDIA KMS:
echo options nvidia_drm modeset=1 | sudo tee -a /etc/modprobe.d/nvidia-kms.conf
  • Generating a new xorg.conf while both displays are plugged in and working through with NVIDIA X Server utility. Maybe fiddle with HardDPMS option in xorg.conf see if anything changes. (More info search for “HardDPMS”)

Reboots are required after each change.

Stack trace in demsg when options nvidia_drm modeset=1 is configured:

[  195.862426] Call Trace:
[  195.862427]  drm_new_set_master+0x7e/0x100 [drm]
[  195.862449]  drm_master_open+0x6e/0xa0 [drm]
[  195.862471]  drm_open+0xf8/0x250 [drm]
[  195.862494]  drm_stub_open+0xba/0x140 [drm]
[  195.862520]  chrdev_open+0xf7/0x220
[  195.862524]  ? cdev_device_add+0x90/0x90
[  195.862527]  do_dentry_open+0x156/0x370
[  195.862531]  vfs_open+0x2d/0x30
[  195.862535]  do_open+0x1c3/0x340
[  195.862538]  path_openat+0x10a/0x1d0
[  195.862541]  ? psi_group_change+0x42/0x220
[  195.862544]  do_filp_open+0x8c/0x130
[  195.862549]  ? __check_object_size+0x1c/0x20
[  195.862552]  do_sys_openat2+0x9b/0x150
[  195.862556]  __x64_sys_openat+0x56/0x90
[  195.862560]  do_syscall_64+0x38/0x90
[  195.862563]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  195.862567] RIP: 0033:0x7ff62edd08db
[  195.862569] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
[  195.862571] RSP: 002b:00007ffd86c25810 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  195.862573] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff62edd08db
[  195.862575] RDX: 0000000000000002 RSI: 00007ffd86c258e0 RDI: 00000000ffffff9c
[  195.862576] RBP: 00007ffd86c258e0 R08: 0000000000000000 R09: 00007ffd86c25720
[  195.862578] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[  195.862579] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  195.862582] ---[ end trace af495a49ff84593d ]---
``
What's the default and/or recommended for this option? I didn't check before making the change.

I’m out of ideas! Maybe creating a new topic will get you more help.

@humblebee
Please do spin up a new topic, since this one has been tagged as being solved (you solved it :) )

Please repeat any details you feel may be relevant.
I have some more volunteers to help on base Linux issues - so hopefully we can help you faster this time around, but a new topic helps a ton - thanks!

@humblebee What does systemctl start nvidia-<foo>.service mean? Does it put the system into the desired state immediately? All the nvidia service on my Ubuntu 21.04 are in “enabeld” and “inactive (dead)” status, except the nvidia-persistence.service which is “active (running)”. Running `systemctl start nvidia-.suspend.service hangs my system.

Those systemd units are not intended to be started manually, but rather as part of the systemd-suspend.service’s life cycle, which is used to do the low-level work of the systemctl suspend command.

The nvidia-suspend.service unit signals to the NVIDIA driver that it should suspend application access to the GPU, evict the contents of the GPU’s video memory, and get the GPU ready for system suspend. If the system doesn’t actually suspend (e.g. because you started that service manually rather than relying on systemctl suspend to do it for you) then it just wedges anything that tries to access the GPU until you manually start the corresponding nvidia-resume.service.

2 Likes

This topic was automatically closed after 3 days. New replies are no longer allowed.