but I cannot tell if they are related in general and in a timely fashion, because the journalctl output is limited to nvidia related messages only. You could check yourself with: journalctl -bX, where X is to replace by 0 for the last boot and -1, etc for previous boots.
Seems chrome triggers this. Can you check if it happens if you turn off gpu acceleration in chrome?
Edit 2:
There should also be a newer driver version available from ubuntu, or the graphics-drivers PPA. Please also try that.
i wouldn’t blame chrome as this happened before and that time it was Telegram Desktop app… So i don’t really understand this suggestion - what should i do - not use GPU after resuming from suspend at all? Are you kidding me?
I’ve checked journalctl -b-1 - i can see the problem you mention Xid=31 happens at the moment system is resuming from sleep. I suppose the time is off because it has not been synced yet at that moment.
Mar 31 11:45:05 gingerblade kernel: CPU7 is up
Mar 31 11:45:05 gingerblade kernel: smpboot: Booting Node 0 Processor 8 APIC 0x5
Mar 31 11:45:05 gingerblade kernel: CPU8 is up
Mar 31 11:45:05 gingerblade kernel: smpboot: Booting Node 0 Processor 9 APIC 0x7
Mar 31 11:45:05 gingerblade kernel: CPU9 is up
Mar 31 11:45:05 gingerblade kernel: smpboot: Booting Node 0 Processor 10 APIC 0x9
Mar 31 11:45:05 gingerblade kernel: CPU10 is up
Mar 31 11:45:05 gingerblade kernel: smpboot: Booting Node 0 Processor 11 APIC 0xb
Mar 31 11:45:05 gingerblade kernel: CPU11 is up
Mar 31 11:45:05 gingerblade kernel: ACPI: Waking up from system sleep state S4
Mar 31 11:45:05 gingerblade kernel: ACPI Error: No handler for Region [VRTC] (000000008b578016) [SystemCMOS] (20200528/evregion-127)
Mar 31 11:45:05 gingerblade kernel: ACPI Error: Region SystemCMOS (ID=5) has no handler (20200528/exfldio-261)
Mar 31 11:45:05 gingerblade kernel: No Local Variables are initialized for Method [RTEC]
Mar 31 11:45:05 gingerblade kernel: No Arguments are initialized for method [RTEC]
Mar 31 11:45:05 gingerblade kernel: ACPI Error: Aborting method \_SB.PCI0.LPCB.EC0.RTEC due to previous error (AE_NOT_EXIST) (20200528/psparse-529)
Mar 31 11:45:05 gingerblade kernel: ACPI Error: Aborting method \RWAK due to previous error (AE_NOT_EXIST) (20200528/psparse-529)
Mar 31 11:45:05 gingerblade kernel: ACPI Error: Aborting method \_WAK due to previous error (AE_NOT_EXIST) (20200528/psparse-529)
Mar 31 11:45:05 gingerblade kernel: ACPI Error: AE_NOT_EXIST, While executing method \_WAK (20200528/hwesleep-47)
Mar 31 11:45:05 gingerblade kernel: ACPI: EC: interrupt unblocked
Mar 31 11:45:05 gingerblade kernel: pcieport 0000:00:1b.4: Intel SPT PCH root port ACS workaround enabled
Mar 31 11:45:05 gingerblade kernel: pcieport 0000:00:1b.0: Intel SPT PCH root port ACS workaround enabled
Mar 31 11:45:05 gingerblade kernel: pcieport 0000:00:1d.0: Intel SPT PCH root port ACS workaround enabled
Mar 31 11:45:05 gingerblade kernel: usb usb1: root hub lost power or was reset
Mar 31 11:45:05 gingerblade kernel: usb usb2: root hub lost power or was reset
Mar 31 11:45:05 gingerblade kernel: ACPI: EC: event unblocked
Mar 31 11:45:05 gingerblade kernel: usb usb5: root hub lost power or was reset
Mar 31 11:45:05 gingerblade kernel: usb usb6: root hub lost power or was reset
Mar 31 11:45:05 gingerblade kernel: nvme nvme0: 12/0/0 default/read/poll queues
Mar 31 11:45:05 gingerblade kernel: NVRM: GPU at PCI:0000:01:00: GPU-528d1d55-f343-fdc0-b25f-b5a49a4eac61
Mar 31 11:45:05 gingerblade kernel: NVRM: GPU Board Serial Number:
Mar 31 11:45:05 gingerblade kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=226, Ch 00000001, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x75_b8f74000. Fault is of type FAULT_PDE ACCESS_TYPE_VI>
Mar 31 11:45:05 gingerblade kernel: usb 1-8: reset full-speed USB device number 3 using xhci_hcd
Mar 31 11:45:05 gingerblade kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=226, Ch 00000000, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0xff_c0822000. Fault is of type FAULT_PDE ACCESS_TYPE_VI>
Mar 31 11:45:05 gingerblade kernel: usb usb3: root hub lost power or was reset
Mar 31 11:45:05 gingerblade kernel: usb usb4: root hub lost power or was reset
Mar 31 11:45:05 gingerblade kernel: usb 1-7: reset high-speed USB device number 2 using xhci_hcd
Mar 31 11:45:05 gingerblade kernel: restoring control 00000000-0000-0000-0000-000000000101/10/5
Mar 31 11:45:05 gingerblade kernel: restoring control 00000000-0000-0000-0000-000000000101/12/11
Mar 31 11:45:05 gingerblade kernel: usb 1-14: reset full-speed USB device number 4 using xhci_hcd
Mar 31 11:45:05 gingerblade kernel: acpi LNXPOWER:06: Turning OFF
Mar 31 11:45:05 gingerblade kernel: acpi LNXPOWER:02: Turning OFF
Mar 31 11:45:05 gingerblade kernel: PM: hibernation: Basic memory bitmaps freed
Mar 31 11:45:05 gingerblade kernel: OOM killer enabled.
Mar 31 11:45:05 gingerblade kernel: Restarting tasks ... done.
Mar 31 11:45:05 gingerblade kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
Mar 31 11:45:05 gingerblade kernel: thermal thermal_zone8: failed to read out thermal zone (-61)
Mar 31 11:45:05 gingerblade kernel: PM: hibernation: hibernation exit
Mar 31 11:45:05 gingerblade kernel: Bluetooth: hci0: Firmware revision 0.0 build 100 week 47 2019
Mar 31 11:45:04 gingerblade systemd-sleep[603034]: System resumed.
I can’t see any more recent drivers available than the ones i have. Could you point me please to the correct repo?
$ dpkg -l | grep -E "\bnvidia-"
ii nvidia-compute-utils-460 460.39-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
ii nvidia-cuda-dev 10.1.243-3 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 10.1.243-3 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 10.1.243-3 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 10.1.243-3 amd64 NVIDIA CUDA development toolkit
ii nvidia-dkms-460 460.39-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
ii nvidia-driver-460 460.39-0ubuntu0.20.04.1 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-460 460.39-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-460 460.39-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package
ii nvidia-opencl-dev:amd64 10.1.243-3 amd64 NVIDIA OpenCL development files
ii nvidia-prime 0.8.16~0.20.04.1 all Tools to enable NVIDIA's Prime
ii nvidia-profiler 10.1.243-3 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 460.39-0ubuntu0.20.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-460 460.39-0ubuntu0.20.04.1 amd64 NVIDIA driver support binaries
ii nvidia-visual-profiler 10.1.243-3 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-460 460.39-0ubuntu0.20.04.1 amd64 NVIDIA binary Xorg driver
as far i can see - this should make sense only when lots of GPU RAM is being wired/allocated at the time of hibernate/suspend event, right? I have 8GB GPU RAM and only 1 GB is being used normally by Xorg, Chrome and some other apps.
Do you think i should try configuring second option using /proc/driver/nvidia/suspend ?
Option one, which is the default (the one you should be using right now), is limited by functionality.
As said, I’d give it a try. Won’t hurt and is easily revertible. The only thing to watch out, is to have enough space on the drive/mountpoint.
ok, so i’ve updated to most recent driver from PPA, i’ll try configuring option #2 and hibernating later today - if it’ll crash again i’ll provide another nvidia-bug-report.log.gz. Can i expect it would be analyzed by Nvidia after that?
There were some improvements I made in the 465.19.01 beta for suspend/resume with the power management stuff enabled. I know it might be difficult to test the beta if you’re using a PPA but would it be possible to give it a try?
Ah, I read the changelog, but nothing I understood about suspend/resume improvements, except the automatic installation.
For xyapus:
If you want to give it a try, make sure you purge all nvida-driver ppa files (apt purge nvidia* libnvidia*), before installing via .run file. And stop the X server before installation (i.e. systemctl isolate multi-user-target).
There was a lot of intertwined behavior around VT switches and suspend/resume that I tried to untangle for the 465 series. All of it hinges off of the NVreg_PreserveVideoMemory=1 module parameter, which is still disabled by default in most cases. The suspend/hibernate/resume systemd units are required for the video memory preservation to function, which is why I made an effort to make the installer set those up automatically. If you’re using a PPA or other distribution packages, you’ll need to check with them to determine whether those systemd services are installed or enabled by default.
So the current state of things in 465.19.01 is that if you use the .run installer on a systemd distro, the only thing you’re supposed to need to do manually is enable NVreg_PreserveVideoMemory=1.
While using 460.67-0ubuntu0~0.20.04.1 i tried manually following this Configuring Power Management Support guide and installed required systemd services. I’ve set up /tmp to use tmpfs of proper size using the /etc/systemd/system/tmp.mount so:
$ mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=10485760k)
I cannot resume from hibernate when NVreg_PreserveVideoMemoryAllocations=1
i’ve also tried other TemporaryFilePath locations as the doc states that
To achieve the best performance, file system types other than tmpfs are recommended at this time.
So i changed to NVreg_TemporaryFilePath=/tmp.nvidia and created dir /tmp.nvidia - but still can’t restore from hibernation with the same error as above.
@aplattner from the changelog it’s not clear to me if anything regarding my problem has been changed in the driver itself between v460 and v465. I can see that the systemd units installation are now automated, but i’ve managed to do it manually, so do i still have to go with that beta? Honestly i’m not too comfortable with betas…