BUG: `can't change power state from D3cold to D0 (config space inaccessible)`, stuck at boot

Originally asked here - https://forum.manjaro.org/t/optimus-random-freezes-at-boot-cannot-change-nvidia-gfx-state-from-d3-to-d0/127089 where I was told about this being a possible nvidia driver bug.

Basically, I recently installed manjaro latest including latest nvidia hybrid driver, and prime (I have nvidia optimus).

Mostly, the setup works fine. But every 5-6 boots, the booting process randomly hangs after tlp system startup shutdown. No errors/warnings in dmesg, journalctl. Just a freeze. And rebooting by pressing the power button makes it usually boot in the next try.

It is a fresh linux install, unmodified, so I have not configured anything on it.

I tried 3 different linux kernels. 5.4 lts, 5.5 & 5.6 rc3. I can reproduce the issue on all 3 kernels. But 5.6rc3 is the only one which displays an error, which says -

nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)

Additional information:

inxi -Fxxxz
System:
  Host: manjaro Kernel: 5.5.6-1-MANJARO x86_64 bits: 64 compiler: gcc 
  v: 9.2.1 Desktop: KDE Plasma 5.17.5 tk: Qt 5.14.1 wm: kwin_x11 dm: SDDM 
  Distro: Manjaro Linux 
Machine:
  Type: Laptop System: Micro-Star product: PE62 7RD v: REV:1.0 
  serial: <filter> Chassis: type: 10 serial: <filter> 
  Mobo: Micro-Star model: MS-16J9 v: REV:1.0 serial: <filter> 
  UEFI: American Megatrends v: E16J9IMS.324 date: 03/23/2018 
Battery:
  ID-1: BAT1 charge: 38.1 Wh condition: 38.1/42.4 Wh (90%) 
  volts: 12.2/10.8 model: MSI BIF0_9 type: Li-ion serial: N/A 
  status: Full 
CPU:
  Topology: Quad Core model: Intel Core i7-7700HQ bits: 64 type: MT MCP 
  arch: Kaby Lake rev: 9 L2 cache: 6144 KiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
  bogomips: 44817 
  Speed: 1200 MHz min/max: 800/3800 MHz Core speeds (MHz): 1: 1200 
  2: 1200 3: 1200 4: 1200 5: 1201 6: 1200 7: 1200 8: 1200 
Graphics:
  Device-1: Intel HD Graphics 630 vendor: Micro-Star MSI driver: i915 
  v: kernel bus ID: 00:02.0 chip ID: 8086:591b 
  Device-2: NVIDIA GP107M [GeForce GTX 1050 Mobile] 
  vendor: Micro-Star MSI driver: nvidia v: 440.59 bus ID: 01:00.0 
  chip ID: 10de:1c8d 
  Display: x11 server: X.Org 1.20.7 driver: modesetting,nvidia 
  alternate: fbdev,intel,nouveau,nv,vesa compositor: kwin_x11 
  resolution: 1920x1080~60Hz 
  OpenGL: renderer: Mesa DRI Intel HD Graphics 630 (Kaby Lake GT2) 
  v: 4.6 Mesa 19.3.4 compat-v: 3.0 direct render: Yes 
Audio:
  Device-1: Intel CM238 HD Audio vendor: Micro-Star MSI 
  driver: snd_hda_intel v: kernel bus ID: 00:1f.3 chip ID: 8086:a171 
  Sound Server: ALSA v: k5.5.6-1-MANJARO 
Network:
  Device-1: Intel Dual Band Wireless-AC 3168NGW [Stone Peak] 
  driver: iwlwifi v: kernel port: e000 bus ID: 02:00.0 chip ID: 8086:24fb 
  IF: wlp2s0 state: up mac: <filter> 
  Device-2: Qualcomm Atheros QCA8171 Gigabit Ethernet 
  vendor: Micro-Star MSI driver: alx v: kernel port: d000 bus ID: 03:00.0 
  chip ID: 1969:10a1 
  IF: enp3s0 state: down mac: <filter> 
Drives:
  Local Storage: total: 931.51 GiB used: 23.08 GiB (2.5%) 
  ID-1: /dev/sda vendor: HGST (Hitachi) model: HTS721010A9E630 
  size: 931.51 GiB speed: 6.0 Gb/s rotation: 7200 rpm serial: <filter> 
  rev: A3U0 scheme: GPT 
Partition:
  ID-1: / size: 244.47 GiB used: 23.05 GiB (9.4%) fs: ext4 dev: /dev/sda6 
Sensors:
  System Temperatures: cpu: 46.0 C mobo: 27.8 C 
  Fan Speeds (RPM): N/A 
Info:
  Processes: 223 Uptime: 3m Memory: 15.56 GiB used: 968.4 MiB (6.1%) 
  Init: systemd v: 242 Compilers: gcc: 9.2.1 Shell: zsh v: 5.8 
  running in: konsole inxi: 3.0.37

nvidia-bug-report.log.gz (206 KB)

Can someone from nvidia confirm whether this is a bug with the nvidia drivers or my hardware or linux or something else. @aplattner?

Also, I found [v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges - Patchwork , which may be related. But I am not sure whether it affects the proprietary driver + prime. Also, according to the last comment on that patch, it looks like they did not find a proper way to fix it anyways :(

Iā€™m sorry, I donā€™t know offhand whether this is a system bug or a driver bug. Based on the symptoms it sounds like a system-level problem but itā€™s hard to know for sure.

For now, I would recommend disabling dynamic power management by removing whatever rule is setting NVreg_DynamicPowerManagement=2 (presumably in /etc/modprobe.d/* somewhere).

For now I am using GitHub - dglt1/optimus-switch-sddm: easy installer for optimus-switch for SDDM, sets up nvidia PRIME and also allows for easy switching between intel/nvidia (prime mode) and intel only mode where nvidia gpu is powered down and no longer visible. modes are switched with "sudo set-intel.sh" or "sudo set-nvidia.sh" . to switch to intel only mode. Hopefully, this issue gets solved soon.

I tracked down someone who is familiar with this problem and unfortunately it does sound like a bug in the system firmware rather than something we can fix in the driver. If I understand correctly, the system is powering the GPU off before the driver has a chance to load and initialize it properly.

Please check to see if there is a BIOS update available for your system.

No updates available. AFAICT, msi stopped updates for this laptop.
https://www.msi.com/Laptop/support/PE62-7RD.html
BIOS last update 2018.
Firmware last update 2017.
And I have no idea if there is way to contact msi. Even if there was a way to contact them, I lack any technical knowledge required to explain this problem to them.

So I am guessing, using linux with nvidia on this laptop is a no-go? @aplattner

hello?

Iā€™m experiencing the same problem, however, it appears that my nvidia card isnā€™t showing up in lspci:

 sudo lspci |grep -E "VGA|3D"
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)

Though Iā€™m seeing similar dmesg info at boot:

[   34.191562] xhci_hcd 0000:3e:00.0: xHCI Host Controller
[   34.191644] xhci_hcd 0000:3e:00.0: new USB bus registered, assigned bus number 4
[   34.191652] xhci_hcd 0000:3e:00.0: Host supports USB 3.1 Enhanced SuperSpeed
[   34.191693] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.05
[   34.191694] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[   34.191696] usb usb4: Product: xHCI Host Controller
[   34.191697] usb usb4: Manufacturer: Linux 5.5.10-200.fc31.x86_64 xhci-hcd
[   34.191698] usb usb4: SerialNumber: 0000:3e:00.0
[   34.191851] hub 4-0:1.0: USB hub found
[   34.191860] hub 4-0:1.0: 2 ports detected
[   34.363952] thunderbolt 0-0: ignoring unnecessary extra entries in DROM
[   75.347447] pcieport 0000:07:02.0: can't change power state from D3cold to D0 (config space inaccessible)
[   75.347465] xhci_hcd 0000:3e:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[   75.347473] xhci_hcd 0000:3e:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[   75.347487] xhci_hcd 0000:3e:00.0: Controller not ready at resume -19
[   75.347488] xhci_hcd 0000:3e:00.0: PCI post-resume error -19!
[   75.347489] xhci_hcd 0000:3e:00.0: HC died; cleaning up
[   75.347502] xhci_hcd 0000:3e:00.0: remove, state 4
[   75.347505] usb usb4: USB disconnect, device number 1
[   75.347736] xhci_hcd 0000:3e:00.0: USB bus 4 deregistered
[   75.347806] xhci_hcd 0000:3e:00.0: remove, state 4
[   75.347809] usb usb3: USB disconnect, device number 1
[   75.348004] xhci_hcd 0000:3e:00.0: Host halt failed, -19
[   75.348006] xhci_hcd 0000:3e:00.0: Host not accessible, reset failed.
[   75.348063] xhci_hcd 0000:3e:00.0: USB bus 3 deregistered

wagoodman.mlbx, this looks like some incorrect udev rules placed by your distro. Please run
grep 10de /lib/udev/rules.d/*
and post the output.

Looks like it can be fixed in the nvidia drivers 156341 ā€“ Nvidia fails to power on again, resulting in AML_INFINITE_LOOP/lockups (multiple laptops affected)

I tried nvidia 450.57 to test out the newer PM features on GeForce GTX 1650, and every reboot I got this error preventing me from actually using the laptop.

I am still testing this out but thereā€™s that one workaround that allowed me to use the laptop. Add ā€œmodprobe.blacklist=nvidiaā€ on the kernel command line before it boots.

The theory of the WA is that nvidia driver must be loaded later and that the ā€œmodprobe.blacklist=nvidiaā€ on the kernel command line does not actually disable/prevent loading the nvidia driver. That line just prevents it from being loaded automatically, though it still gets loaded by the time X starts.

Something tells me that a race is going on somewhere, maybe because nvidia assumes that the GPU is completely powered off (D3cold) when itā€™s not actually off, but after some time Linuxā€™s runtime PM powers off the GPU which coincidentally happens before the time X starts. Then the nvidia driver loads, proceeds to power on (D0) the GPU and succeeds.

Itā€™s not fool proof though, that there are still chances where this error could still pop up and freeze the system on its wake.

I agree that there is a race somewhere because I have the problem maybe 80% of the time: I need to boot my laptop 5 times so that it finally initializes everything. When it gets stuck, I have the following in dmesg:

[   15.796957] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[   15.796998] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[   15.797882] iwlwifi 0000:3b:00.0: HW_REV=0xFFFFFFFF, PCI issues?
[   15.888498] iwlwifi: probe of 0000:3b:00.0 failed with error -5

And these are the only error messages I have in dmesg. Even then, my laptop boots up to the graphical interface and everything, but none of the network cards can be found if the initialization failed.

When the initialization suceeds, dmesg contains (amongst other of course):

[   12.288394] iwlwifi 0000:3b:00.0: enabling device (0000 -> 0002)
[   12.322697] iwlwifi 0000:3b:00.0: firmware: direct-loading firmware iwlwifi-9260-th-b0-jf-b0-46.ucode

Excerpt of lspci (after a successful boot):

01:00.0 3D controller: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] (rev a1)
3b:00.0 Network controller: Intel Corporation Wireless-AC 9260 (rev 29)

Iā€™m using a Debian testing system.

I am somewhat doubtful that what seems to be a wifi initialization issue is actually the same problem, but I actually have a NVIDIA card, and google does not show many pages with the exact same error message that I seeā€¦

1 Like

Iā€™ve completed my tests (quite a while back) regarding this issue and I seem to observe a pretty simple pattern.

PCI PM state changes doesnā€™t seem to occur instantly after it is requested meaning a state change from D0 ā†’ D3cold or vice versa may take a few seconds.

During that time, any state changes requested will result to ā€œcanā€™t change power state from D3cold to D0ā€. This could be that the PCI device cannot stop midway on the transition.

In my system (a laptop), I have two ways to trigger this error:

First way is to trigger a transition by writing ā€˜autoā€™ or ā€˜onā€™ to the PCI deviceā€™s PM depending on which is already set. Then, try loading the module.
For example in my system the result of
cat /sys/bus/pci/devices/0000:01:00.0/power/control is on so I do a
echo auto > /sys/bus/pci/devices/0000:01:00.0/power/control.
After which, I try to load the nvidia module then the system freezes which that error.

The second way is using the fact that a udev rule sets the PCI deviceā€™s PM to ā€œonā€ when unloaded. You just have to unload then reload the module.
The udev rule is from PCI-Express Runtime D3 (RTD3) Power Management in the Automated Setup section.
For example. After making sure that the nvidia module is loaded, I wait for about 2 minutes then run modprobe -r nvidia; modprobe nvidia. This also freezes the system with that error.

So my conclusion? I think the foolproof way here is to never set the PCI deviceā€™s runtime PM control to ā€˜autoā€™ during or on boot up and only set it after the system has initialized. Or, to manually load the module after a successful boot or pretty much delay loading it as much as it can be delayed.

Installing optimus-manager and forcing the use of nvidia card seems to has sorted the issue for me on Manjaro XFCE

Hello everyone,

Is this this the right area for reporting this issue?
Back on September, I was still using the 450.57 driver with the 5.8 Linux kernel and Ubuntu 20.04.

On 5.10 Linux kernel with the 460.39 version, the workaround I mentioned does not work anymore.

Additionally, the ā€œNVreg_DynamicPowerManagement=0x02ā€ option is now the only way to suspend the NVIDIA GPU when not in use. The ā€œNVreg_DynamicPowerManagement=0x01ā€ will suspend it only if there are no applications running on it. But on 460.39, Xorg creates something like a persistent glxserver for NVIDIA. That counts as an application so the driver will never put it to suspend even if it is not being used.

For the driver development team, I think I have stumbled upon an easier reproduction steps to trigger this bug while the system is still running (making data collection possible)

  1. Reboot with nvidia, nvidia_drm and nvidia_modeset blacklisted. Make sure that these modules are not loaded but still can be loaded manually.
  2. Make sure that the /sys/bus/pci/devices/0000:01:00.0/power/control is on not auto
  3. Make sure that /sys/bus/pci/devices/0000:01:00.0/power/runtime_status reads active
  4. Run nvidia-smi or nvidia-bug-report.sh which should eventually load the nvidia kernel module.
  5. You get a Killed message for nvidia-smi or nothing for the `nvidia-bug-report.sh. Nevertheless, the bug should have been triggered, the GPU will not be usable, and the system is in the brink of crashing.

I attached the output and the dmesgs log for two cases:

  1. Just running nvidia-bug-report.sh (I also captured the dmesg log after it)
    nvidia-bug-report.tar.gz (694.3 KB)

  2. Running nvidia-smi then nvidia-bug-report.sh. The nvidia-bug-report.sh hangs.
    nvidia-smi.tar.gz (742.9 KB)

If more information is needed, please reply and I will try my best to provide it.

I am having the seame issue in my workstation with 2x V100 GPUs.
I can use Nvidia driver 450.119.03 with no issues but I get the following error during startup when I upgrade to Nvidia driver 460.73.01. Then, the startup stops at the following message and the graphical login is not shown.

`[10.404497] pcieport 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)`

I got this error originally with Ubuntu 18.04 (with kernels 5.4.0-54.60 and 5.4.0-73.82). In order to try solving this issue, I have upgraded to Ubuntu 20.04 (kernel 5.8.0-53.60), also upgrade my workstation BIOS, but it did not help. I have also tried blacklisting (and not) noveau driver. Following @ bugmenot.oss instructions, I have run cat /sys/bus/pci/devices/0000:01:00.0/power/control (my GPU devices seems to be at 0000:15:00 and 0000:2d:00) and observed that the config was auto. Therefore, I could not edit the config because I had no write permission to /sys, despite mounting it as read-only.

Had to return to driver 450 to get my Ubuntu graphics back. But I do need the 460 driver for development with CUDA 11.2.

Here are the nvidia-bug-report with Nvidia driver 450 and with 460.
nvidia-bug-report_with_nvidia-driver-450.log.gz (727.7 KB)
nvidia-bug-report_with_nvidia-driver-460.log.gz (562.7 KB)

Having the same issue with a GTX 1650 on Manjaro linux 21.2.0 running kernel 5.14 and NVIDIA 495.44 (hybrid graphics)

It seems to happen more frequently when rebooting from Windows, even with Fast Boot toggled off.
Hereā€™s nvidia-bug-report on a correct boot (I cannot run it when the system crashes because itā€™s a full blown system hang)

nvidia-bug-report.log.gz (263.2 KB)

Surprisingly, I didnā€™t have this issues on Ubuntu. Although I have deleted the install and donā€™t remember which drivers were running there.
I have also noticed that disabling the NVIDIA card in the BIOS will stop the bug from happening.

I also encountered this problem.
I installed the nvidia driver for my optimus laptop and then my system wouldnā€™t boot consistently. 1 out of 5 times it would boot normally, 1/5 it would boot with Nvidia Daemon Failing and 3/5 it would not boot and freeze before login.
This is the ā€œraceā€ reported above, also when it booted with Nvidia Failing dmesg showed the bug in the title.
All I had to do was to blacklist the nvidia driver in /etc/modprobe.d/bumblebee.conf. This way the module is loaded after boot, when bumblebee is called.
This is explicitly mentioned in the configuration file which is created during bumblebee installation:
From /etc/modprobe.d/bumblebee.conf:

_# do not automatically load nvidia as itā€™s unloaded anyway when bumblebeed
_# starts and may fail bumblebeed to disable the card in a race condition.

Hope this helps someone.


Running On:

Asus N551JW Laptop (Optimus)
Linux 5.10.0-13-amd64
Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M]
Driver Version: 460.91.03