Ubuntu 20.04 with nvidia-460 driver freezes randomly after resume from suspend/hibernate

There were some improvements I made in the 465.19.01 beta for suspend/resume with the power management stuff enabled. I know it might be difficult to test the beta if you’re using a PPA but would it be possible to give it a try?

Ah, I read the changelog, but nothing I understood about suspend/resume improvements, except the automatic installation.

For xyapus:
If you want to give it a try, make sure you purge all nvida-driver ppa files (apt purge nvidia* libnvidia*), before installing via .run file. And stop the X server before installation (i.e. systemctl isolate multi-user-target).

There was a lot of intertwined behavior around VT switches and suspend/resume that I tried to untangle for the 465 series. All of it hinges off of the NVreg_PreserveVideoMemory=1 module parameter, which is still disabled by default in most cases. The suspend/hibernate/resume systemd units are required for the video memory preservation to function, which is why I made an effort to make the installer set those up automatically. If you’re using a PPA or other distribution packages, you’ll need to check with them to determine whether those systemd services are installed or enabled by default.

So the current state of things in 465.19.01 is that if you use the .run installer on a systemd distro, the only thing you’re supposed to need to do manually is enable NVreg_PreserveVideoMemory=1.

Thank you for the explanation @aplattner !

@xyapus you find that driver and notes here for example: Current graphics driver releases

While using 460.67-0ubuntu0~0.20.04.1 i tried manually following this Configuring Power Management Support guide and installed required systemd services. I’ve set up /tmp to use tmpfs of proper size using the /etc/systemd/system/tmp.mount so:

$ mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=10485760k)

I cannot resume from hibernate when NVreg_PreserveVideoMemoryAllocations=1

Apr 01 08:38:39 gingerblade kernel: PM: hibernation: Read 5139808 kbytes in 4.57 seconds (1124.68 MB/s)
Apr 01 08:38:39 gingerblade kernel: PM: Image successfully loaded
Apr 01 08:38:39 gingerblade kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Apr 01 08:38:39 gingerblade kernel: NVRM: GPU 0000:01:00.0: PreserveVideoMemoryAllocations module parameter is set. System Power Management attempted without driver procfs suspend interface. Please refer to the 
'Configuring Power Management Support' section in the driver README.
Apr 01 08:38:39 gingerblade kernel: PM: pci_pm_freeze(): nv_pmops_freeze+0x0/0x20 [nvidia] returns -5
Apr 01 08:38:39 gingerblade kernel: PM: dpm_run_callback(): pci_pm_freeze+0x0/0xc0 returns -5
Apr 01 08:38:39 gingerblade kernel: PM: Device 0000:01:00.0 failed to quiesce async: error -5
Apr 01 08:38:39 gingerblade kernel: PM: hibernation: Failed to load image, recovering.
Apr 01 08:38:39 gingerblade kernel: PM: hibernation: Basic memory bitmaps freed
Apr 01 08:38:39 gingerblade kernel: PM: hibernation: resume failed (-5)

Am i missing something?

i’ve also tried other TemporaryFilePath locations as the doc states that

To achieve the best performance, file system types other than tmpfs are recommended at this time.

So i changed to NVreg_TemporaryFilePath=/tmp.nvidia and created dir /tmp.nvidia - but still can’t restore from hibernation with the same error as above.

@aplattner from the changelog it’s not clear to me if anything regarding my problem has been changed in the driver itself between v460 and v465. I can see that the systemd units installation are now automated, but i’ve managed to do it manually, so do i still have to go with that beta? Honestly i’m not too comfortable with betas…

tmpfs is a temporary filesystem that resides in memory and/or swap partition(s). Mounting directories as tmpfs can be an effective way of speeding up accesses to their files, or to ensure that their contents are automatically cleared upon reboot.

Having your resume file cleared upon reboot sounds like a no go. As the file is gone.

Can you please do systemctl status nvidia-suspend nvidia-hibernate nvidia-resume to verify that the systemd services are actually enabled? Also, how did you trigger the suspend? You need to use systemctl suspend or systemctl hibernate rather than writing to /sys/power/state directly.

Data in tmpfs is included in the hibernation image that the kernel writes to the disk, so that data should still be there during a resume from hibernation. I.e. it might be slower but it should at least still work as long as tmpfs has enough space to store the contents of video memory.

$ systemctl status nvidia-suspend nvidia-hibernate nvidia-resume
● nvidia-suspend.service - NVIDIA system suspend actions
     Loaded: loaded (/etc/systemd/system/nvidia-suspend.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

● nvidia-hibernate.service - NVIDIA system hibernate actions
     Loaded: loaded (/etc/systemd/system/nvidia-hibernate.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

● nvidia-resume.service - NVIDIA system resume actions
     Loaded: loaded (/etc/systemd/system/nvidia-resume.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

And yes i use sudo systemctl hibernate.

I tried both configurations - with tmpfs and without tmpfs with no luck

Ah ok, little weird to wrap your head around, but I take it ;-)

That’s awfully strange – if those units are enabled then systemd-hibernate.service should have run them. Does systemctl status systemd-hibernate.service show anything about it running the nvidia ones?

Nope, should it?

$ sudo systemctl status systemd-hibernate.service
● systemd-hibernate.service - Hibernate
 Loaded: loaded (/lib/systemd/system/systemd-hibernate.service; static; vendor preset: enabled)
 Active: inactive (dead)
   Docs: man:systemd-suspend.service(8)

Yes, quite a bit changed in v465. There were some fixes for data corruption on some GPUs and the interaction between the X server and OpenGL clients during VT switches (which happen during suspend too) was significantly simplified when NVreg_PreserveVideoMemory=1 is enabled.

Yeah, if it actually started that service during hibernate then there should be messages about it in the journal. My system isn’t set up for hibernate but this is what I get for the similar suspend path:

> systemctl status systemd-suspend
● systemd-suspend.service - Suspend
     Loaded: loaded (/usr/lib/systemd/system/systemd-suspend.service; static)
     Active: inactive (dead)
       Docs: man:systemd-suspend.service(8)

Mar 30 23:08:49 aplattner systemd[1]: Starting Suspend...
Mar 30 23:08:49 aplattner systemd-sleep[2263539]: Suspending system...
Mar 31 07:43:38 aplattner systemd-sleep[2263539]: System resumed.
Mar 31 07:43:38 aplattner systemd[1]: systemd-suspend.service: Succeeded.
Mar 31 07:43:38 aplattner systemd[1]: Finished Suspend.
Mar 31 23:29:18 aplattner systemd[1]: Starting Suspend...
Mar 31 23:29:18 aplattner systemd-sleep[3774631]: Suspending system...
Apr 01 00:39:58 aplattner systemd-sleep[3774631]: System resumed.
Apr 01 00:39:58 aplattner systemd[1]: systemd-suspend.service: Succeeded.
Apr 01 00:39:58 aplattner systemd[1]: Finished Suspend.

I could give that beta a try later on, but at the moment i’m not sure my setup is configured correctly, so i cannot be sure that hibernate is totally non-usable with the latest PPA driver i have…

My concern about betas is that this is my main PC and crippling it with beta drivers doesn’t sound like a good idea…

If i do systemctl suspend then i can also see some log-messages as you have. These messages do not live across reboots, so i cannot 100% verify if they are there for systemctl hibernate because when i wake up from hibernate - system can’t resume and reboots, so after reboot these messages are gone

I can see that, sure.

It’s good to know that you’re at least seeing the log messages from suspend. Is suspend & resume working correctly for you and it’s just hibernate that’s not working?

If possible, it might be useful to run journalctl -f & from an SSH session before triggering hibernate. If there’s something going wrong during the hibernation phase then maybe you’d see it that way. If the problem is occurring during resume then it’s a little tougher – you might be able to enable verbose logging on the kernel command line somewhere to see if there are any errors on the console, but you won’t be able to see them on the SSH connection that way.

Edit: Oh, I guess the messages in your earlier comment are from the resume phase. It’s strange that the nvidia kernel module doesn’t think it was suspended with the procfs/systemd interface there.

Ok, forum blocked me from replying earlier. Thanks for the directions - i’ll try again later with the different options you mention and get back to you with the results.

Yes message log is from the resume stage, i’ll try SSH session method to see if there’re any errors during hibernation phase.

So i tried different options but cannot get hibernation work. Here’s log of suspend - i can see nvidia-hibernate.service is being called for sure and succeeds:

Apr 04 12:30:21 gingerblade systemd[1]: Reached target Sleep.
Apr 04 12:30:21 gingerblade systemd[1]: Starting NVIDIA system hibernate actions...
Apr 04 12:30:21 gingerblade hibernate[8060]: nvidia-hibernate.service
Apr 04 12:30:21 gingerblade logger[8060]: <13>Apr  4 12:30:21 hibernate: nvidia-hibernate.service
Apr 04 12:30:21 gingerblade systemd[1]: nvidia-hibernate.service: Succeeded.
Apr 04 12:30:21 gingerblade systemd[1]: Finished NVIDIA system hibernate actions.
Apr 04 12:30:21 gingerblade systemd[1]: Starting Hibernate...
Apr 04 12:30:21 gingerblade kernel: PM: Image not found (code -22)
Apr 04 12:30:21 gingerblade systemd-sleep[8071]: Suspending system...

Then in the resume stage i only get same error - nvidia breaks resume and system boots fresh boot:

Apr 04 12:31:41 gingerblade kernel: PM: Image signature found, resuming
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: resume from hibernation
Apr 04 12:31:41 gingerblade kernel: Freezing user space processes ... (elapsed 0.001 seconds) done.
Apr 04 12:31:41 gingerblade kernel: OOM killer disabled.
Apr 04 12:31:41 gingerblade kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x00000000-0x00000fff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x0005e000-0x0005efff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x000a0000-0x000fffff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x7c928000-0x7c928fff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x7c948000-0x7c948fff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x7cf48000-0x7cf61fff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x7f15b000-0x7f15bfff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x82a8e000-0x85c4dfff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Marking nosave pages: [mem 0x85c4f000-0xffffffff]
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Basic memory bitmaps created
Apr 04 12:31:41 gingerblade kernel: PM: Using 3 thread(s) for decompression
Apr 04 12:31:41 gingerblade kernel: PM: Loading and decompressing image data (1130381 pages)...
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:   0%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  10%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  20%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  30%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  40%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  50%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  60%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  70%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  80%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress:  90%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading progress: 100%
Apr 04 12:31:41 gingerblade kernel: PM: Image loading done
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Read 4521524 kbytes in 3.90 seconds (1159.36 MB/s)
Apr 04 12:31:41 gingerblade kernel: PM: Image successfully loaded
Apr 04 12:31:41 gingerblade kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Apr 04 12:31:41 gingerblade kernel: NVRM: GPU 0000:01:00.0: PreserveVideoMemoryAllocations module parameter is set. System Power Management attempted without driver procfs suspend interface. Please refer to the>
Apr 04 12:31:41 gingerblade kernel: PM: pci_pm_freeze(): nv_pmops_freeze+0x0/0x20 [nvidia] returns -5
Apr 04 12:31:41 gingerblade kernel: PM: dpm_run_callback(): pci_pm_freeze+0x0/0xc0 returns -5
Apr 04 12:31:41 gingerblade kernel: PM: Device 0000:01:00.0 failed to quiesce async: error -5
Apr 04 12:31:41 gingerblade kernel: nvme nvme0: 12/0/0 default/read/poll queues
Apr 04 12:31:41 gingerblade kernel: fbcon: Taking over console
Apr 04 12:31:41 gingerblade kernel: Console: switching to colour frame buffer device 240x67
Apr 04 12:31:41 gingerblade kernel: PM: hibernation: Failed to load image, recovering.

I tried with & without tmpfs mount for /tmp folder - it makes no difference

Is there anything else i can provide you with? Any logs that might be helpful?

Is this log still with the release 460 drivers? I haven’t had a chance to set my system back up for hibernate to try the 460 drivers yet to see if I can reproduce this, sorry.