GPU randomly fails to wake up from D3Cold

While the GPU is in D3Cold (Video Memory: Off), it might fail to wake up, randomly.
This is visible by the application trying to wake the GPU up freezing (until timeouts happen) and the following DMESG messages appearing in the log:

[  417.180294] NVRM: gpuWaitForGfwBootComplete_TU102: failed to wait for GFW_BOOT: (progress 0x9)
[  417.180299] NVRM: kgspWaitForGfwBootOk_TU102: failed to wait for GFW boot complete: 0x55 VBIOS version 94.04.43.00.9F
[  417.180300] NVRM: kgspWaitForGfwBootOk_TU102: (the GPU may be in a bad state and may need to be reset)
[  423.189349] NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
[rest in the nvidia-bug-report attachment]

Reproduction steps

  1. Install Nvidia driver (open-dkms variant)
  2. Configured udev rules based on Chapter 22 of readme
  3. Configured the following driver options:
options nvidia_drm modeset=1 fbdev=1
options nvidia "NVreg_DynamicPowerManagement=0x02" "NVreg_DynamicPowerManagementVideoMemoryThreshold=256"
  1. sudo mkinitcpio -P, sudo update-grub, full reboot of the system
  2. Manually removed audio PCI device of the card using:
echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove
  1. Wait for card to go into suspended/D3Cold mode
  2. Run prime-run glxgears or sudo nvidia-smi to wake the card up

Repeat steps 5 and 6 until the bug triggers. Usually the GPU wakes up successfully several times before this happens.

Hardware

Notebook: Acer Nitro AN515-45
Graphics card: GeForce RTX 3080 Mobile 8GB
No external monitor attached
Happens when running both on AC power and battery

OS info

Manjaro (unstable branch, fully updated as of 18/10/2024)
Kernel: 6.11.4-1-MANJARO
Driver: nvidia-open-dkms 560.35.03
Desktop environment: KDE Plasma 6.2.1.1 running under Wayland

nvidia-bug-report attached
nvidia-bug-report.log.gz (558.4 KB)

One suggestion i can make over the top of my head is to disable nvidia-persistance, in the past this would render the gpu to fail, and therefore needing a reboot to go back only then to not work again.
I can also suggest to maybe try a cachyOS/EndeavourOS liveCD and see if d3cold works ok with nvidia-smi, if thats the then its most like manjaro or something about configuration, unless d3cold was working previously and then just stopped.
Btw you shouldn’t need to make the nvidia.conf file for this d3cold at least(ideally) should just work ootb for ur gpu…

Hi! I disabled the nvidia-persistenced.service, but after a reboot the bug happened on the first try, so this definitely didn’t help.
Trying a different distribution is currently not an option for me (even as a liveCD/flash to boot from), but might be in the coming weeks. I will post an update if I get to try that.