[Regression 460 series] Black screen on boot: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer

I have the same issue with the 460.39 driver on Ubuntu 20.04/Kernel 5.8.0 on Dell Precision 7510 with Quadro M2000M.

Exactly the same happens on my Razor 15" (2018) without any external monitor. NVIDIA GeForce GTX 1060 Max-Q. It happened on Ubuntu 20.04 and still happens on Ubuntu 20.10. Branches 450 and 460 both does not work properly. Only the legacy version 390 works correctly.

This started happening for me (black screen when resuming after suspend) after an automatic update to 460.32.03 from 455.38 (see history.log file). I’m on an Acer Aspire 7 with NVIDIA GeForce GTX 1050, with Ubuntu 20.04.2 LTS.

I’ve been circumventing the issue by keeping my PRIME Profile on ‘Intel (Power Saving Mode)’ unless I’m using some more graphics intensive software.

12-01-12_history.log (12.8 KB) nvidia-bug-report.log.gz (403.8 KB)

Same issue on a Lenovo IdeaPad L340 with NVIDIA GeForce GTX 1050, Ubuntu 20.04.2 LTS. Only happens when coming back from sleep mode.

Happens to me as well on a Lenovo Legion Y720 with Ubuntu 20.04 , driver 460 . If I have a second monitor attached via the HDMI port screen doesn’t stay blank. In fact if I remove the second monitor suspend still works until the laptop is rebooted without the second monitor. So if I want suspend to work I just need to connect a second monitor, then remove it. If I switch to my Intel card using prime-select this also works.

I’ve rolled back to 450 drivers. But came back here to see whether a fix was found or developed by nvidia.
Now the new driver 465.24.02 notes discusses various regressions that are fixed. It is unclear however whether our problem is in that list. I don’t think it is, but there is the mention of a regression related to suspend behaviour. Is it worth testing these drivers? Or is it better to wait longer for a new production branch driver?

I’ve tested the 465 drivers but they are susceptible to the same issue. :-(
Sadly after rolling back to the 450 drivers I have run into another issue, ubuntu keeps pushing me to update. How do I block these updates?

bram@bram-Zbook:~$ sudo apt list --upgradable
Bezig met oplijsten... Klaar
libnvidia-cfg1-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-common-450/focal,focal 460.73.01-0ubuntu0.20.04.2 all [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-compute-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-compute-450/focal 460.73.01-0ubuntu0.20.04.2 i386 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-decode-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-decode-450/focal 460.73.01-0ubuntu0.20.04.2 i386 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-encode-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-encode-450/focal 460.73.01-0ubuntu0.20.04.2 i386 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-extra-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-fbc1-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-fbc1-450/focal 460.73.01-0ubuntu0.20.04.2 i386 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-gl-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-gl-450/focal 460.73.01-0ubuntu0.20.04.2 i386 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-ifr1-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
libnvidia-ifr1-450/focal 460.73.01-0ubuntu0.20.04.2 i386 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
nvidia-compute-utils-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
nvidia-dkms-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
nvidia-driver-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
nvidia-kernel-common-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
nvidia-kernel-source-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
nvidia-utils-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
xserver-xorg-video-nvidia-450/focal 460.73.01-0ubuntu0.20.04.2 amd64 [opwaardeerbaar van: 450.119.03-0ubuntu0.20.04.1]
1 Like

For those people running into this only after suspend/resume: please check if setting up this:
https://download.nvidia.com/XFree86/Linux-x86_64/460.67/README/powermanagement.html
prevents the freeze on resume.

I followed that guide but it didn’t aid me in any way. Still a hard-freeze after resume is done.

I also tried the beta driver which sets up most of the stuff in the guide directly but that behaved even weirder with my card.

But I’m a bit curios cause I don’t think experienced this issue when running Fedora 34 on the same machine. Then I switched to Ubuntu 20.04 and the issue came back…:/

I’ll try this out next week, but I’m not entirely sure why this should help. From what I understand the proposed fix relates to memory issues on suspend. There are a couple of things that I don’t get, specifically:
The GPU state saved by the NVIDIA kernel drivers includes allocations made in video memory. However, these allocations are collectively large, and typically cannot be evicted. Since the amount of system memory available to drivers at suspend time is often insufficient to accommodate large portions of video memory, the NVIDIA kernel drivers are designed to act conservatively, and normally only save essential video memory allocations.

  • My graphics card has only 4043MiB of memory, that is not much, the fix discusses cases where this amount of memory could not be stored before hibernation.
  • The issue also appears when using Nvidia Prime On-Demand where the Nvidia X Server Settings indicates a usage of 8MB of graphic memory utilization. There is certainly enough space to store this during hibernation.
  • Why did this issue not appear in previous drivers if this is related to the problem of to little allocation for video-memory storage during hibernation.

My Quadro M1000M does not support S0ix based power management. Hence I’ll try the /proc/driver/nvidia/suspend + Save allocations in an unnamed temporary file approach. But there I’m having issues as well, it indicates the use of services, such as nvidia-suspend.service, which I cannot find. Neither in the /etc/systemd/system/ folder, nor can I find the folder /usr/share/doc/NVIDIA_GLX-1.0/samples.
If I need these services, where can I obtain them?

I also saw this same problem,
GeForce GTX 970M
Linux Mint / Ubuntu Focal
I Timeshift-restored back to 450.119.03-0ubuntu0.20.04.1 and it suspend/resumes fine again.
The 460 version did not.
I did not roll forward to 465 as it didn’t seem to solve anyone else’s problems.

1 Like

Same problem here, it happens whenever it suspends…

Tested on:
Ubuntu 20.04.2 LTS
NVIDIA-SMI 460.80
GeForce GTX 1060
Avell G1544 Fox

Hi there, I’ve tried the solution proposed by @generix but to (I think) no avail. Nevertheless, I’ve obtained some new date, could anybody help to make sense of them.

In my case only the systemd (/proc/driver/nvidia/suspend) with an unnamed temporary file. The S0ix-based approach is not supported by my system.

At first: I tried to reconfirm the behaviour of my system at the moment. Without applying any changes. I’m running the nvidia-driver-465 which is just installed.
I found that nvidia-suspend.service, nvidia-hibernate.service,nvidia-resume.service, nvidia and nvidia-sleep.sh at their correct locations already.
Running sudo service nvidia-suspend status indicated that it was loaded, but inactive:

sudo service nvidia-suspend status
● nvidia-suspend.service - NVIDIA system suspend actions
     Loaded: loaded (/lib/systemd/system/nvidia-suspend.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

The issue with the suspending (sudo systemctl suspend) was as expected, the system did hang during the wake procedure.

The issue with the hibernation (sudo systemctl hibernate) was something that I haven’t seen described before. During hibernation, something goes wrong and the system ends up starting as if it is a normal boot.
During the hibernation an output appears on my screen. I made a photo, but never managed to capture it properly. it says something on the lines of:

[  226.099407] snd_hda_intel 000:01:00.1: can't change power state from D3hot to D0 (config space inaccessible)
[  227.610591] tpm tpm0: tpm_try_transmit: send(): error -5

I can the first error somewhere in the system log, which is appended, but only after the --reboot-- while I see the message during the hibernation process already.
In the journal.txt you can find that the nvidia-hibernation.service was called upon, lines 115-117.

Is it reasonable to assume that the ‘hibernation becomes normal boot’ issue is related to the suspend issue discussed in this tread?

Planned Today: is to enable the NVreg_preserveVideoMemoryAllocations=1 kernel module parameter in nvidia.ko. I verified with cat /proc/driver/nvidia/params that this is not yet the case.
Similarly, I saw that the TemporaryFilePath is empty, I’ll point it to /run for now as is proposed.
I’ll send an update when I’ve managed to set these properties.

NOTE: I also tried to go over the nvidia readme files. Chapter 9. Known Issues send me to Chapter 16. Configuring a Notebook that does discuss suspend and hibernation issues. It does discuss three different causes for the suspend resume behaviour.

  1. Issues with PCI Express bus clocks: I tried their solution to keep a OpenGL application running. This failed.
  2. System memory issues, suspend image size could be an issue, but the command sudo echo 0 > /sys/power/image_size that is proposed returns ‘access-denied’. This assumption ties into the idea that the video-memory handling is the issue. That is why I’m looking at a.o. Chapter 21 Configuring Power Management Support and why I’ve also procured extra system memory. (I can’t wait for that to arrive).
  3. vbetool is not an issue, it is, and has not been, installed on my system.
    It seems that Nvidia is aware of the issues that we experience.

journal.txt (647.9 KB) nvidia-bug-report.log.gz (458.5 KB)

I’ve enabled NVreg_preserveVideoMemoryAllocations, set TemporaryFilePath and tested the suspend and hibernate behaviour. To no avail.

Exact steps: I tried to copy the following manual on the Arch forum.

  1. Created a new folder in root where the files of the suspend can be stored
sudo mkdir /tmp-nvidia
  1. Created a file with info regarding the nvidia kernel options:
sudo nano /etc/modprobe.d/nvidia-power-management.conf
options nvidia NVreg_PreserveVideoMemoryAllocations=1 NVreg_TemporaryFilePath=/tmp-nvidia
  1. Reboot
  2. Checked the settings with:
cat /proc/driver/nvidia/params
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
RegisterForACPIEvents: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 1
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableGpuFirmware: 2
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: "/tmp-nvidia"
ExcludedGpus: ""

The options are recognized.

Suspend: suspend journal.txt (724.1 KB)
The following details are of importance in this log:

  • Lines 175-1777 and 1808-1809: The nvidia-suspend.service is called upon and claims to be successful.
  • Lines 1813-1814: Between these two lines the system was suspended.
  • Lines 2042-2045: The nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT]) errors mentioned in posts above.
  • Line 2640: Reboot which was requried to make the system responsive again.

Hibernate: hibernate journal.txt (666.8 KB)

  • Lines 103-105 and 138-139: Call and success of nvidia-hibernate.service.
  • Line 144: Actual reboot
  • Lines 1688-1689: again issues with the power state from D3cold (or hot) to D0.
  • Similar to before it is not a real hibernate as the system does end up with a fresh boot.

Note: I’ve found other posts that also discuss this solution for an issue that is apparently equal to ours. The advice there also surrounded the memory handling of sleep. But it did not solve their issue either. Interesting however are the remarks [1, 2] of an Nvidia moderator that there is work done on the suspend/resume behaviour of the 465.19.01 beta.

Nvidia bugreport: nvidia-bug-report.log.gz (467.0 KB)

I’ve just added more memory to my system, sadly it didn’t help. But before suspend, I did have more than double my gpu memory as free space on system memory.
I’m at a loss to the cause of this issue.

This is probably obvious, but the issue disappears when Prime Intel (Power Saving Mode) is selected (normally I run the ‘Nvidia on-demand mode’). In the sys log below you can find several reboots:

  1. Prime Intel boot (line 2) here suspend proceeds perfectly fine. This apparently does call the nvidia-suspend.service (line 3990) and nvidia-resume.service (line 4154) even though it hasn’t loaded the nvidia-drivers. Then I made the first bug report after which then I changed to Prime Nvidia (Performance mode) see line 9216.
  2. Prime nvidia boot (line 10087) does show the same behaviour as reported before. Suspend isn’t working properly. As expected, the issue is not the ‘on-demand’ setting itself, but simlpy the nvidia-driver being loaded.

Bugreport Prime Intel: nvidia-bug-report-prime-Intel.log.gz (169.0 KB)
Pugreport Prime Nvidia: nvidia-bug-report-prime-nvidia.log.gz (400.5 KB)
Journal: journal.txt (2.1 MB)

I’ve now also reproduced with 470.42.01, below is a new bug report:

In terms of behaviour, there was a small change: Now my screen turns black immediately upon loading the module, no flickering of the backlight as observed with 460/465 versions.

There’s now also more in the kernel log (of course also contained in the bug report):

Jul 14 19:24:00 localhost kernel: [   33.819880] nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
Jul 14 19:24:00 localhost kernel: [   33.820328] nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
...
Jul 14 19:24:04 localhost kernel: [   38.126480] BUG: kernel NULL pointer dereference, address: 0000000000000070
Jul 14 19:24:04 localhost kernel: [   38.126483] #PF: supervisor read access in kernel mode
Jul 14 19:24:04 localhost kernel: [   38.126484] #PF: error_code(0x0000) - not-present page
Jul 14 19:24:04 localhost kernel: [   38.126485] PGD 0 P4D 0 
Jul 14 19:24:04 localhost kernel: [   38.126488] Oops: 0000 [#1] SMP PTI
Jul 14 19:24:04 localhost kernel: [   38.126490] CPU: 3 PID: 11479 Comm: X Tainted: P           O      5.9.11-gentoo #1
Jul 14 19:24:04 localhost kernel: [   38.126491] Hardware name: Alienware Alienware 17/04WT2G, BIOS A17 07/22/2019
Jul 14 19:24:04 localhost kernel: [   38.126503] RIP: 0010:_nv002520kms+0x18/0x70 [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126504] Code: 24 1f 01 eb b2 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d df 69 0c 00 48 8d 4c 24 0c 89 ee 89 44 24 0c e8 cf
Jul 14 19:24:04 localhost kernel: [   38.126506] RSP: 0018:ffffb032015fbd08 EFLAGS: 00010282
Jul 14 19:24:04 localhost kernel: [   38.126507] RAX: 0000000000000000 RBX: ffff8cf9ee73a008 RCX: 0000000000000082
Jul 14 19:24:04 localhost kernel: [   38.126508] RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff8cf9ee73a008
Jul 14 19:24:04 localhost kernel: [   38.126509] RBP: 0000000000010009 R08: 0000000000000004 R09: 0000000000000000
Jul 14 19:24:04 localhost kernel: [   38.126510] R10: ffffb032015fbc78 R11: ffff8cf9b378b000 R12: ffff8cf9ee73a008
Jul 14 19:24:04 localhost kernel: [   38.126511] R13: ffff8cf9ee73a0a0 R14: 0000000000000fff R15: 0000000000010008
Jul 14 19:24:04 localhost kernel: [   38.126512] FS:  00007f36e09f58c0(0000) GS:ffff8cf9fecc0000(0000) knlGS:0000000000000000
Jul 14 19:24:04 localhost kernel: [   38.126513] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 14 19:24:04 localhost kernel: [   38.126514] CR2: 0000000000000070 CR3: 000000081063c001 CR4: 00000000001706e0
Jul 14 19:24:04 localhost kernel: [   38.126515] Call Trace:
Jul 14 19:24:04 localhost kernel: [   38.126524]  ? _nv002519kms+0xb1/0x150 [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126532]  ? _nv002298kms+0x489/0x670 [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126534]  ? __kmalloc+0x165/0x18c
Jul 14 19:24:04 localhost kernel: [   38.126536]  ? __check_heap_object+0x52/0xff
Jul 14 19:24:04 localhost kernel: [   38.126538]  ? __check_object_size+0x103/0x192
Jul 14 19:24:04 localhost kernel: [   38.126543]  ? nv_kthread_q_stop+0x2246/0x2c76 [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126548]  ? nv_kthread_q_stop+0x227a/0x2c76 [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126553]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126558]  ? nvkms_ioctl_common+0x41/0x10a [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126563]  ? nvkms_ioctl_common+0xdb/0x10a [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.126649]  ? nvidia_frontend_unlocked_ioctl+0x14/0x17 [nvidia]
Jul 14 19:24:04 localhost kernel: [   38.126652]  ? vfs_ioctl+0x19/0x26
Jul 14 19:24:04 localhost kernel: [   38.126653]  ? __do_sys_ioctl+0x63/0x86
Jul 14 19:24:04 localhost kernel: [   38.126656]  ? do_syscall_64+0x5d/0x6a
Jul 14 19:24:04 localhost kernel: [   38.126659]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 14 19:24:04 localhost kernel: [   38.126660] Modules linked in: ccm cmac algif_hash algif_skcipher af_alg bnep zram uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc btusb btrtl btbcm btintel bluetooth intel_rapl_msr iwlmvm mac80211 iwlwifi intel_rapl_common intel_powerclamp coretemp vhba(O) kvm_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm dell_wmi cfg80211 snd_hda_intel dell_smbios snd_intel_dspcfg pcspkr snd_hda_codec dell_wmi_descriptor alx snd_hda_core mdio dell_smo8800 dell_rbtn nvidia_drm(PO) nvidia_modeset(PO) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nvidia(PO) sdhci_pci aesni_intel glue_helper iosf_mbi crypto_simd cqhci iTCO_wdt intel_pmc_bxt sdhci rtsx_pci_sdmmc mmc_core
Jul 14 19:24:04 localhost kernel: [   38.126673] CR2: 0000000000000070
Jul 14 19:24:04 localhost kernel: [   38.126674] ---[ end trace c7c301411c6c99f7 ]---
Jul 14 19:24:04 localhost kernel: [   38.159246] RIP: 0010:_nv002520kms+0x18/0x70 [nvidia_modeset]
Jul 14 19:24:04 localhost kernel: [   38.159248] Code: 24 1f 01 eb b2 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d df 69 0c 00 48 8d 4c 24 0c 89 ee 89 44
 24 0c e8 cf
Jul 14 19:24:04 localhost kernel: [   38.159249] RSP: 0018:ffffb032015fbd08 EFLAGS: 00010282
Jul 14 19:24:04 localhost kernel: [   38.159251] RAX: 0000000000000000 RBX: ffff8cf9ee73a008 RCX: 0000000000000082
Jul 14 19:24:04 localhost kernel: [   38.159251] RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff8cf9ee73a008
Jul 14 19:24:04 localhost kernel: [   38.159252] RBP: 0000000000010009 R08: 0000000000000004 R09: 0000000000000000
Jul 14 19:24:04 localhost kernel: [   38.159253] R10: ffffb032015fbc78 R11: ffff8cf9b378b000 R12: ffff8cf9ee73a008
Jul 14 19:24:04 localhost kernel: [   38.159254] R13: ffff8cf9ee73a0a0 R14: 0000000000000fff R15: 0000000000010008
Jul 14 19:24:04 localhost kernel: [   38.159255] FS:  00007f36e09f58c0(0000) GS:ffff8cf9fecc0000(0000) knlGS:0000000000000000
Jul 14 19:24:04 localhost kernel: [   38.159256] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 14 19:24:04 localhost kernel: [   38.159256] CR2: 0000000000000070 CR3: 000000081063c001 CR4: 00000000001706e0

Any ideas welcome. There’s currently no maintained functional driver for these cards, unless I go back to the legacy drivers.

nvidia-bug-report.log.gz (1.2 MB)

I can reproduce this in the new driver as well, in my case 470.57.02.

  journalctl -b -1:
# Some kind of traceback related to the Nvidia device
jul 25 19:44:41 bram-Zbook kernel: WARNING: CPU: 2 PID: 15529 at /var/lib/dkms/nvidia/470.57.02/build/nvidia/nv.c:4175 nv_set_system_power_state+0x2c1/0x3c0 [nvidia]
jul 25 19:44:41 bram-Zbook kernel: Modules linked in: thunderbolt rfcomm ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc cmac algif_hash algif_skcipher af_alg bnep nls_iso8859_1 nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) snd_hda_codec_conexant snd_hda_codec_generic ledtrig_audio uvcvideo btusb btrtl videobuf2_vmalloc btbcm videobuf2_memops videobuf2_v4l2 btintel videobuf2_common bluetooth videodev mc ecdh_generic ecc mei_hdcp intel_rapl_msr x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm iwlmvm snd_seq_midi snd_seq_midi_event snd_rawmidi mac80211 nvidia(PO) input_leds snd_seq hp_wmi snd_seq_device
jul 25 19:44:41 bram-Zbook kernel:  serio_raw i915 efi_pstore intel_wmi_thunderbolt sparse_keymap libarc4 wmi_bmof snd_timer iwlwifi mxm_wmi drm_kms_helper ee1004 cfg80211 snd cec soundcore rc_core processor_thermal_device joydev i2c_algo_bit intel_rapl_common fb_sys_fops syscopyarea intel_soc_dts_iosf mei_me sysfillrect sysimgblt mei intel_pch_thermal mac_hid int3403_thermal hp_accel int340x_thermal_zone acpi_pad int3400_thermal tpm_infineon lis3lv02d hp_wireless acpi_thermal_rel sch_fq_codel coretemp parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_alps hid_generic rtsx_pci_sdmmc nvme ahci crc32_pclmul psmouse i2c_i801 intel_lpss_pci e1000e libahci rtsx_pci i2c_smbus i2c_hid nvme_core intel_lpss idma64 xhci_pci virt_dma xhci_pci_renesas hid video pinctrl_sunrisepoint wmi pinctrl_intel
jul 25 19:44:41 bram-Zbook kernel: CPU: 2 PID: 15529 Comm: nvidia-sleep.sh Tainted: P        W  O      5.8.0-63-generic #71~20.04.1-Ubuntu
jul 25 19:44:41 bram-Zbook kernel: Hardware name: HP HP ZBook Studio G3/80D4, BIOS N82 Ver. 01.52 10/28/2020
jul 25 19:44:41 bram-Zbook kernel: RIP: 0010:nv_set_system_power_state+0x2c1/0x3c0 [nvidia]
jul 25 19:44:41 bram-Zbook kernel: Code: 00 4d 85 e4 0f 84 4a ff ff ff 41 83 fd 02 74 e9 49 8b bc 24 88 02 00 00 be 02 00 00 00 e8 57 d0 ff ff 85 c0 74 d3 0f 0b eb cf <0f> 0b e9 64 ff ff ff 48 c7 c7 50 ea a1 c2 e8 0c 10 a9 dd e8 47 1b
jul 25 19:44:41 bram-Zbook kernel: RSP: 0018:ffffacebc4b5fe20 EFLAGS: 00010206
jul 25 19:44:41 bram-Zbook kernel: RAX: 0000000000000003 RBX: 0000000000000002 RCX: 0000000080020001
jul 25 19:44:41 bram-Zbook kernel: RDX: 0000000080020002 RSI: 0000000000000001 RDI: ffff9e41e4992bc0
jul 25 19:44:41 bram-Zbook kernel: RBP: ffffacebc4b5fe50 R08: 0000000000000000 R09: ffffffffc0a6aa01
jul 25 19:44:41 bram-Zbook kernel: R10: ffff9e4164c9b000 R11: 0000000000000001 R12: ffff9e41e8ab4000
jul 25 19:44:41 bram-Zbook kernel: R13: 0000000000000000 R14: ffffacebc4b5fef0 R15: 00005594fb4b6540
jul 25 19:44:41 bram-Zbook kernel: FS:  00007f08062f3740(0000) GS:ffff9e41ef680000(0000) knlGS:0000000000000000
jul 25 19:44:41 bram-Zbook kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jul 25 19:44:41 bram-Zbook kernel: CR2: 00007fe04509f290 CR3: 00000004a53c0006 CR4: 00000000003606e0
jul 25 19:44:41 bram-Zbook kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
jul 25 19:44:41 bram-Zbook kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
jul 25 19:44:41 bram-Zbook kernel: Call Trace:
jul 25 19:44:41 bram-Zbook kernel:  nv_procfs_write_suspend+0xe7/0x140 [nvidia]
jul 25 19:44:41 bram-Zbook kernel:  proc_reg_write+0x66/0x90
jul 25 19:44:41 bram-Zbook kernel:  vfs_write+0xc9/0x200
jul 25 19:44:41 bram-Zbook kernel:  ksys_write+0x67/0xe0
jul 25 19:44:41 bram-Zbook kernel:  __x64_sys_write+0x1a/0x20
jul 25 19:44:41 bram-Zbook kernel:  do_syscall_64+0x49/0xc0
jul 25 19:44:41 bram-Zbook kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
jul 25 19:44:41 bram-Zbook kernel: RIP: 0033:0x7f08064071e7
jul 25 19:44:41 bram-Zbook kernel: Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
jul 25 19:44:41 bram-Zbook kernel: RSP: 002b:00007ffe20e8aa78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
jul 25 19:44:41 bram-Zbook kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f08064071e7
jul 25 19:44:41 bram-Zbook kernel: RDX: 0000000000000007 RSI: 00005594fb4b6540 RDI: 0000000000000001
jul 25 19:44:41 bram-Zbook kernel: RBP: 00005594fb4b6540 R08: 000000000000000a R09: 0000000000000006
jul 25 19:44:41 bram-Zbook kernel: R10: 00005594fb2c0017 R11: 0000000000000246 R12: 0000000000000007
jul 25 19:44:41 bram-Zbook kernel: R13: 00007f08064e26a0 R14: 00007f08064e34a0 R15: 00007f08064e28a0
jul 25 19:44:41 bram-Zbook kernel: ---[ end trace a0e25c3914b46a5a ]---

# End of traceback, the log continues regarding a PCIe device that I don't expect to be the GPU, as it reports a 2.5 GT/s PCIe x4 speed. (NVIDIA X Server Settings reports a x16 link) 
# Then it continues about Nvidia devices.

jul 25 19:44:45 bram-Zbook kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
jul 25 19:44:45 bram-Zbook kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
jul 25 19:44:45 bram-Zbook rtkit-daemon[1269]: Supervising 7 threads of 3 processes of 2 users.
jul 25 19:44:45 bram-Zbook rtkit-daemon[1269]: Successfully made thread 16429 of process 2667 owned by '1000' RT at priority 5.
jul 25 19:44:45 bram-Zbook rtkit-daemon[1269]: Supervising 8 threads of 3 processes of 2 users.
jul 25 19:44:45 bram-Zbook rtkit-daemon[1269]: Supervising 8 threads of 3 processes of 2 users.
jul 25 19:44:45 bram-Zbook rtkit-daemon[1269]: Successfully made thread 16430 of process 2667 owned by '1000' RT at priority 5.
jul 25 19:44:45 bram-Zbook rtkit-daemon[1269]: Supervising 9 threads of 3 processes of 2 users.
jul 25 19:44:45 bram-Zbook kernel: usb 3-1.4.2: new full-speed USB device number 11 using xhci_hcd
jul 25 19:44:45 bram-Zbook kernel: usb 3-1.3.3: New USB device found, idVendor=0c45, idProduct=6341, bcdDevice= 0.00
jul 25 19:44:45 bram-Zbook kernel: usb 3-1.3.3: New USB device strings: Mfr=2, Product=1, SerialNumber=0
jul 25 19:44:45 bram-Zbook kernel: usb 3-1.3.3: Product: USB 2.0 Camera
jul 25 19:44:45 bram-Zbook kernel: usb 3-1.3.3: Manufacturer: Sonix Technology Co., Ltd.
jul 25 19:44:47 bram-Zbook NetworkManager[1042]: <info>  [1627235087.3616] device (eth0): interface index 7 renamed iface from 'eth0' to 'enp63s0'
jul 25 19:44:47 bram-Zbook systemd-udevd[16025]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
jul 25 19:44:49 bram-Zbook kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
jul 25 19:44:49 bram-Zbook kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
jul 25 19:44:49 bram-Zbook /usr/lib/gdm3/gdm-x-session[1261]: (II) config/udev: Adding input device HP HP Dock Audio (/dev/input/event14)
jul 25 19:44:49 bram-Zbook /usr/lib/gdm3/gdm-x-session[1261]: (**) HP HP Dock Audio: Applying InputClass "libinput keyboard catchall"
jul 25 19:44:49 bram-Zbook /usr/lib/gdm3/gdm-x-session[1261]: (II) Using input driver 'libinput' for 'HP HP Dock Audio'
jul 25 19:44:49 bram-Zbook systemd[1]: systemd-suspend.service: Succeeded.
jul 25 19:44:49 bram-Zbook systemd[1]: Finished Suspend.
jul 25 19:44:49 bram-Zbook systemd[1]: Stopped target Sleep.
jul 25 19:44:49 bram-Zbook systemd[1]: Reached target Suspend.
jul 25 19:44:49 bram-Zbook systemd[1]: Starting NVIDIA system resume actions...
jul 25 19:44:49 bram-Zbook systemd[1]: Stopped target Suspend.
jul 25 19:44:49 bram-Zbook suspend[16434]: nvidia-resume.service
jul 25 19:44:49 bram-Zbook logger[16434]: <13>Jul 25 19:44:49 suspend: nvidia-resume.service
jul 25 19:44:49 bram-Zbook systemd[1]: nvidia-resume.service: Succeeded.
jul 25 19:44:49 bram-Zbook systemd[1]: Finished NVIDIA system resume actions.

The non cropped file is appended (journalctl -b -1.log.gz), as is the NVidia bug report (nvidia-bug-report.log.gz).

So I distro-hopped from Elementary OS (Ubuntu variant) to Fedora and the issues disappeared for me. Decided to retry Elementary OS and the issue came back.

To finally fix this on my machine I uninstalled the nvidia driver deb packges and reinstall them using the NVIDIA-*.run install file instead and it worked. Now I’m running Elementary OS using nvidia without sleep-resume crashing.

Feels like whomever is packaging the nvidia-drivers for Ubuntu is doing something which doesn’t play nice with my laptop.

This is an interesting route to pursue, I to run on drivers installed via a PPA for Ubuntu variants. I’m not exactly clear on where to submit the bug report. Originally I expected that I had to report it to the organization that made the ppa:graphics-drivers/ppa but only 3 bugs have been made there in the past. Besides I get the issues also from the drivers that are packaged by the Ubuntu team itself as well. That is I get the problems also when using nvidia-graphics-drivers-470 maintained by the Ubuntu Developers (see for the 460 and 465). These have way more bug reports, and the team seems to respond.

@chris.bainbridge made a bug report about it already, but I don’t see any mention in the 465 and 470 sites. Should I make one?