BUG: `can't change power state from D3cold to D0 (config space inaccessible)`, stuck at boot

Hello everyone,

Is this this the right area for reporting this issue?
Back on September, I was still using the 450.57 driver with the 5.8 Linux kernel and Ubuntu 20.04.

On 5.10 Linux kernel with the 460.39 version, the workaround I mentioned does not work anymore.

Additionally, the “NVreg_DynamicPowerManagement=0x02” option is now the only way to suspend the NVIDIA GPU when not in use. The “NVreg_DynamicPowerManagement=0x01” will suspend it only if there are no applications running on it. But on 460.39, Xorg creates something like a persistent glxserver for NVIDIA. That counts as an application so the driver will never put it to suspend even if it is not being used.

For the driver development team, I think I have stumbled upon an easier reproduction steps to trigger this bug while the system is still running (making data collection possible)

  1. Reboot with nvidia, nvidia_drm and nvidia_modeset blacklisted. Make sure that these modules are not loaded but still can be loaded manually.
  2. Make sure that the /sys/bus/pci/devices/0000:01:00.0/power/control is on not auto
  3. Make sure that /sys/bus/pci/devices/0000:01:00.0/power/runtime_status reads active
  4. Run nvidia-smi or nvidia-bug-report.sh which should eventually load the nvidia kernel module.
  5. You get a Killed message for nvidia-smi or nothing for the `nvidia-bug-report.sh. Nevertheless, the bug should have been triggered, the GPU will not be usable, and the system is in the brink of crashing.

I attached the output and the dmesgs log for two cases:

  1. Just running nvidia-bug-report.sh (I also captured the dmesg log after it)
    nvidia-bug-report.tar.gz (694.3 KB)

  2. Running nvidia-smi then nvidia-bug-report.sh. The nvidia-bug-report.sh hangs.
    nvidia-smi.tar.gz (742.9 KB)

If more information is needed, please reply and I will try my best to provide it.