BUG: `can't change power state from D3cold to D0 (config space inaccessible)`, stuck at boot

Originally asked here - https://forum.manjaro.org/t/optimus-random-freezes-at-boot-cannot-change-nvidia-gfx-state-from-d3-to-d0/127089 where I was told about this being a possible nvidia driver bug.

Basically, I recently installed manjaro latest including latest nvidia hybrid driver, and prime (I have nvidia optimus).

Mostly, the setup works fine. But every 5-6 boots, the booting process randomly hangs after tlp system startup shutdown. No errors/warnings in dmesg, journalctl. Just a freeze. And rebooting by pressing the power button makes it usually boot in the next try.

It is a fresh linux install, unmodified, so I have not configured anything on it.

I tried 3 different linux kernels. 5.4 lts, 5.5 & 5.6 rc3. I can reproduce the issue on all 3 kernels. But 5.6rc3 is the only one which displays an error, which says -

nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)

Additional information:

inxi -Fxxxz
System:
  Host: manjaro Kernel: 5.5.6-1-MANJARO x86_64 bits: 64 compiler: gcc 
  v: 9.2.1 Desktop: KDE Plasma 5.17.5 tk: Qt 5.14.1 wm: kwin_x11 dm: SDDM 
  Distro: Manjaro Linux 
Machine:
  Type: Laptop System: Micro-Star product: PE62 7RD v: REV:1.0 
  serial: <filter> Chassis: type: 10 serial: <filter> 
  Mobo: Micro-Star model: MS-16J9 v: REV:1.0 serial: <filter> 
  UEFI: American Megatrends v: E16J9IMS.324 date: 03/23/2018 
Battery:
  ID-1: BAT1 charge: 38.1 Wh condition: 38.1/42.4 Wh (90%) 
  volts: 12.2/10.8 model: MSI BIF0_9 type: Li-ion serial: N/A 
  status: Full 
CPU:
  Topology: Quad Core model: Intel Core i7-7700HQ bits: 64 type: MT MCP 
  arch: Kaby Lake rev: 9 L2 cache: 6144 KiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
  bogomips: 44817 
  Speed: 1200 MHz min/max: 800/3800 MHz Core speeds (MHz): 1: 1200 
  2: 1200 3: 1200 4: 1200 5: 1201 6: 1200 7: 1200 8: 1200 
Graphics:
  Device-1: Intel HD Graphics 630 vendor: Micro-Star MSI driver: i915 
  v: kernel bus ID: 00:02.0 chip ID: 8086:591b 
  Device-2: NVIDIA GP107M [GeForce GTX 1050 Mobile] 
  vendor: Micro-Star MSI driver: nvidia v: 440.59 bus ID: 01:00.0 
  chip ID: 10de:1c8d 
  Display: x11 server: X.Org 1.20.7 driver: modesetting,nvidia 
  alternate: fbdev,intel,nouveau,nv,vesa compositor: kwin_x11 
  resolution: 1920x1080~60Hz 
  OpenGL: renderer: Mesa DRI Intel HD Graphics 630 (Kaby Lake GT2) 
  v: 4.6 Mesa 19.3.4 compat-v: 3.0 direct render: Yes 
Audio:
  Device-1: Intel CM238 HD Audio vendor: Micro-Star MSI 
  driver: snd_hda_intel v: kernel bus ID: 00:1f.3 chip ID: 8086:a171 
  Sound Server: ALSA v: k5.5.6-1-MANJARO 
Network:
  Device-1: Intel Dual Band Wireless-AC 3168NGW [Stone Peak] 
  driver: iwlwifi v: kernel port: e000 bus ID: 02:00.0 chip ID: 8086:24fb 
  IF: wlp2s0 state: up mac: <filter> 
  Device-2: Qualcomm Atheros QCA8171 Gigabit Ethernet 
  vendor: Micro-Star MSI driver: alx v: kernel port: d000 bus ID: 03:00.0 
  chip ID: 1969:10a1 
  IF: enp3s0 state: down mac: <filter> 
Drives:
  Local Storage: total: 931.51 GiB used: 23.08 GiB (2.5%) 
  ID-1: /dev/sda vendor: HGST (Hitachi) model: HTS721010A9E630 
  size: 931.51 GiB speed: 6.0 Gb/s rotation: 7200 rpm serial: <filter> 
  rev: A3U0 scheme: GPT 
Partition:
  ID-1: / size: 244.47 GiB used: 23.05 GiB (9.4%) fs: ext4 dev: /dev/sda6 
Sensors:
  System Temperatures: cpu: 46.0 C mobo: 27.8 C 
  Fan Speeds (RPM): N/A 
Info:
  Processes: 223 Uptime: 3m Memory: 15.56 GiB used: 968.4 MiB (6.1%) 
  Init: systemd v: 242 Compilers: gcc: 9.2.1 Shell: zsh v: 5.8 
  running in: konsole inxi: 3.0.37

nvidia-bug-report.log.gz (206 KB)

Can someone from nvidia confirm whether this is a bug with the nvidia drivers or my hardware or linux or something else. @aplattner?

Also, I found https://patchwork.kernel.org/patch/11195507/ , which may be related. But I am not sure whether it affects the proprietary driver + prime. Also, according to the last comment on that patch, it looks like they did not find a proper way to fix it anyways :(

I’m sorry, I don’t know offhand whether this is a system bug or a driver bug. Based on the symptoms it sounds like a system-level problem but it’s hard to know for sure.

For now, I would recommend disabling dynamic power management by removing whatever rule is setting NVreg_DynamicPowerManagement=2 (presumably in /etc/modprobe.d/* somewhere).

For now I am using https://github.com/dglt1/optimus-switch-sddm to switch to intel only mode. Hopefully, this issue gets solved soon.

I tracked down someone who is familiar with this problem and unfortunately it does sound like a bug in the system firmware rather than something we can fix in the driver. If I understand correctly, the system is powering the GPU off before the driver has a chance to load and initialize it properly.

Please check to see if there is a BIOS update available for your system.

No updates available. AFAICT, msi stopped updates for this laptop.
https://www.msi.com/Laptop/support/PE62-7RD.html
BIOS last update 2018.
Firmware last update 2017.
And I have no idea if there is way to contact msi. Even if there was a way to contact them, I lack any technical knowledge required to explain this problem to them.

So I am guessing, using linux with nvidia on this laptop is a no-go? @aplattner

hello?

I’m experiencing the same problem, however, it appears that my nvidia card isn’t showing up in lspci:

 sudo lspci |grep -E "VGA|3D"
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)

Though I’m seeing similar dmesg info at boot:

[   34.191562] xhci_hcd 0000:3e:00.0: xHCI Host Controller
[   34.191644] xhci_hcd 0000:3e:00.0: new USB bus registered, assigned bus number 4
[   34.191652] xhci_hcd 0000:3e:00.0: Host supports USB 3.1 Enhanced SuperSpeed
[   34.191693] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.05
[   34.191694] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[   34.191696] usb usb4: Product: xHCI Host Controller
[   34.191697] usb usb4: Manufacturer: Linux 5.5.10-200.fc31.x86_64 xhci-hcd
[   34.191698] usb usb4: SerialNumber: 0000:3e:00.0
[   34.191851] hub 4-0:1.0: USB hub found
[   34.191860] hub 4-0:1.0: 2 ports detected
[   34.363952] thunderbolt 0-0: ignoring unnecessary extra entries in DROM
[   75.347447] pcieport 0000:07:02.0: can't change power state from D3cold to D0 (config space inaccessible)
[   75.347465] xhci_hcd 0000:3e:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[   75.347473] xhci_hcd 0000:3e:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[   75.347487] xhci_hcd 0000:3e:00.0: Controller not ready at resume -19
[   75.347488] xhci_hcd 0000:3e:00.0: PCI post-resume error -19!
[   75.347489] xhci_hcd 0000:3e:00.0: HC died; cleaning up
[   75.347502] xhci_hcd 0000:3e:00.0: remove, state 4
[   75.347505] usb usb4: USB disconnect, device number 1
[   75.347736] xhci_hcd 0000:3e:00.0: USB bus 4 deregistered
[   75.347806] xhci_hcd 0000:3e:00.0: remove, state 4
[   75.347809] usb usb3: USB disconnect, device number 1
[   75.348004] xhci_hcd 0000:3e:00.0: Host halt failed, -19
[   75.348006] xhci_hcd 0000:3e:00.0: Host not accessible, reset failed.
[   75.348063] xhci_hcd 0000:3e:00.0: USB bus 3 deregistered

wagoodman.mlbx, this looks like some incorrect udev rules placed by your distro. Please run
grep 10de /lib/udev/rules.d/*
and post the output.

Looks like it can be fixed in the nvidia drivers https://bugzilla.kernel.org/show_bug.cgi?id=156341#c169

I tried nvidia 450.57 to test out the newer PM features on GeForce GTX 1650, and every reboot I got this error preventing me from actually using the laptop.

I am still testing this out but there’s that one workaround that allowed me to use the laptop. Add “modprobe.blacklist=nvidia” on the kernel command line before it boots.

The theory of the WA is that nvidia driver must be loaded later and that the “modprobe.blacklist=nvidia” on the kernel command line does not actually disable/prevent loading the nvidia driver. That line just prevents it from being loaded automatically, though it still gets loaded by the time X starts.

Something tells me that a race is going on somewhere, maybe because nvidia assumes that the GPU is completely powered off (D3cold) when it’s not actually off, but after some time Linux’s runtime PM powers off the GPU which coincidentally happens before the time X starts. Then the nvidia driver loads, proceeds to power on (D0) the GPU and succeeds.

It’s not fool proof though, that there are still chances where this error could still pop up and freeze the system on its wake.

I agree that there is a race somewhere because I have the problem maybe 80% of the time: I need to boot my laptop 5 times so that it finally initializes everything. When it gets stuck, I have the following in dmesg:

[   15.796957] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[   15.796998] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[   15.797882] iwlwifi 0000:3b:00.0: HW_REV=0xFFFFFFFF, PCI issues?
[   15.888498] iwlwifi: probe of 0000:3b:00.0 failed with error -5

And these are the only error messages I have in dmesg. Even then, my laptop boots up to the graphical interface and everything, but none of the network cards can be found if the initialization failed.

When the initialization suceeds, dmesg contains (amongst other of course):

[   12.288394] iwlwifi 0000:3b:00.0: enabling device (0000 -> 0002)
[   12.322697] iwlwifi 0000:3b:00.0: firmware: direct-loading firmware iwlwifi-9260-th-b0-jf-b0-46.ucode

Excerpt of lspci (after a successful boot):

01:00.0 3D controller: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] (rev a1)
3b:00.0 Network controller: Intel Corporation Wireless-AC 9260 (rev 29)

I’m using a Debian testing system.

I am somewhat doubtful that what seems to be a wifi initialization issue is actually the same problem, but I actually have a NVIDIA card, and google does not show many pages with the exact same error message that I see…

I’ve completed my tests (quite a while back) regarding this issue and I seem to observe a pretty simple pattern.

PCI PM state changes doesn’t seem to occur instantly after it is requested meaning a state change from D0 -> D3cold or vice versa may take a few seconds.

During that time, any state changes requested will result to “can’t change power state from D3cold to D0”. This could be that the PCI device cannot stop midway on the transition.

In my system (a laptop), I have two ways to trigger this error:

First way is to trigger a transition by writing ‘auto’ or ‘on’ to the PCI device’s PM depending on which is already set. Then, try loading the module.
For example in my system the result of
cat /sys/bus/pci/devices/0000:01:00.0/power/control is on so I do a
echo auto > /sys/bus/pci/devices/0000:01:00.0/power/control.
After which, I try to load the nvidia module then the system freezes which that error.

The second way is using the fact that a udev rule sets the PCI device’s PM to “on” when unloaded. You just have to unload then reload the module.
The udev rule is from PCI-Express Runtime D3 (RTD3) Power Management in the Automated Setup section.
For example. After making sure that the nvidia module is loaded, I wait for about 2 minutes then run modprobe -r nvidia; modprobe nvidia. This also freezes the system with that error.

So my conclusion? I think the foolproof way here is to never set the PCI device’s runtime PM control to ‘auto’ during or on boot up and only set it after the system has initialized. Or, to manually load the module after a successful boot or pretty much delay loading it as much as it can be delayed.

Installing optimus-manager and forcing the use of nvidia card seems to has sorted the issue for me on Manjaro XFCE