Originally asked here - https://forum.manjaro.org/t/optimus-random-freezes-at-boot-cannot-change-nvidia-gfx-state-from-d3-to-d0/127089 where I was told about this being a possible nvidia driver bug.
Basically, I recently installed manjaro latest including latest nvidia hybrid driver, and prime (I have nvidia optimus).
Mostly, the setup works fine. But every 5-6 boots, the booting process randomly hangs after
tlp system startup shutdown. No errors/warnings in dmesg, journalctl. Just a freeze. And rebooting by pressing the power button makes it usually boot in the next try.
It is a fresh linux install, unmodified, so I have not configured anything on it.
I tried 3 different linux kernels. 5.4 lts, 5.5 & 5.6 rc3. I can reproduce the issue on all 3 kernels. But 5.6rc3 is the only one which displays an error, which says -
nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Host: manjaro Kernel: 5.5.6-1-MANJARO x86_64 bits: 64 compiler: gcc
v: 9.2.1 Desktop: KDE Plasma 5.17.5 tk: Qt 5.14.1 wm: kwin_x11 dm: SDDM
Distro: Manjaro Linux
Type: Laptop System: Micro-Star product: PE62 7RD v: REV:1.0
serial: <filter> Chassis: type: 10 serial: <filter>
Mobo: Micro-Star model: MS-16J9 v: REV:1.0 serial: <filter>
UEFI: American Megatrends v: E16J9IMS.324 date: 03/23/2018
ID-1: BAT1 charge: 38.1 Wh condition: 38.1/42.4 Wh (90%)
volts: 12.2/10.8 model: MSI BIF0_9 type: Li-ion serial: N/A
Topology: Quad Core model: Intel Core i7-7700HQ bits: 64 type: MT MCP
arch: Kaby Lake rev: 9 L2 cache: 6144 KiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Speed: 1200 MHz min/max: 800/3800 MHz Core speeds (MHz): 1: 1200
2: 1200 3: 1200 4: 1200 5: 1201 6: 1200 7: 1200 8: 1200
Device-1: Intel HD Graphics 630 vendor: Micro-Star MSI driver: i915
v: kernel bus ID: 00:02.0 chip ID: 8086:591b
Device-2: NVIDIA GP107M [GeForce GTX 1050 Mobile]
vendor: Micro-Star MSI driver: nvidia v: 440.59 bus ID: 01:00.0
chip ID: 10de:1c8d
Display: x11 server: X.Org 1.20.7 driver: modesetting,nvidia
alternate: fbdev,intel,nouveau,nv,vesa compositor: kwin_x11
OpenGL: renderer: Mesa DRI Intel HD Graphics 630 (Kaby Lake GT2)
v: 4.6 Mesa 19.3.4 compat-v: 3.0 direct render: Yes
Device-1: Intel CM238 HD Audio vendor: Micro-Star MSI
driver: snd_hda_intel v: kernel bus ID: 00:1f.3 chip ID: 8086:a171
Sound Server: ALSA v: k5.5.6-1-MANJARO
Device-1: Intel Dual Band Wireless-AC 3168NGW [Stone Peak]
driver: iwlwifi v: kernel port: e000 bus ID: 02:00.0 chip ID: 8086:24fb
IF: wlp2s0 state: up mac: <filter>
Device-2: Qualcomm Atheros QCA8171 Gigabit Ethernet
vendor: Micro-Star MSI driver: alx v: kernel port: d000 bus ID: 03:00.0
chip ID: 1969:10a1
IF: enp3s0 state: down mac: <filter>
Local Storage: total: 931.51 GiB used: 23.08 GiB (2.5%)
ID-1: /dev/sda vendor: HGST (Hitachi) model: HTS721010A9E630
size: 931.51 GiB speed: 6.0 Gb/s rotation: 7200 rpm serial: <filter>
rev: A3U0 scheme: GPT
ID-1: / size: 244.47 GiB used: 23.05 GiB (9.4%) fs: ext4 dev: /dev/sda6
System Temperatures: cpu: 46.0 C mobo: 27.8 C
Fan Speeds (RPM): N/A
Processes: 223 Uptime: 3m Memory: 15.56 GiB used: 968.4 MiB (6.1%)
Init: systemd v: 242 Compilers: gcc: 9.2.1 Shell: zsh v: 5.8
running in: konsole inxi: 3.0.37
nvidia-bug-report.log.gz (206 KB)
Can someone from nvidia confirm whether this is a bug with the nvidia drivers or my hardware or linux or something else. @aplattner?
Also, I found [v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges - Patchwork , which may be related. But I am not sure whether it affects the proprietary driver + prime. Also, according to the last comment on that patch, it looks like they did not find a proper way to fix it anyways :(
I’m sorry, I don’t know offhand whether this is a system bug or a driver bug. Based on the symptoms it sounds like a system-level problem but it’s hard to know for sure.
For now, I would recommend disabling dynamic power management by removing whatever rule is setting NVreg_DynamicPowerManagement=2 (presumably in /etc/modprobe.d/* somewhere).
I tracked down someone who is familiar with this problem and unfortunately it does sound like a bug in the system firmware rather than something we can fix in the driver. If I understand correctly, the system is powering the GPU off before the driver has a chance to load and initialize it properly.
Please check to see if there is a BIOS update available for your system.
No updates available. AFAICT, msi stopped updates for this laptop.
BIOS last update 2018.
Firmware last update 2017.
And I have no idea if there is way to contact msi. Even if there was a way to contact them, I lack any technical knowledge required to explain this problem to them.
So I am guessing, using linux with nvidia on this laptop is a no-go? @aplattner
I’m experiencing the same problem, however, it appears that my nvidia card isn’t showing up in lspci:
sudo lspci |grep -E "VGA|3D"
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
Though I’m seeing similar dmesg info at boot:
[ 34.191562] xhci_hcd 0000:3e:00.0: xHCI Host Controller
[ 34.191644] xhci_hcd 0000:3e:00.0: new USB bus registered, assigned bus number 4
[ 34.191652] xhci_hcd 0000:3e:00.0: Host supports USB 3.1 Enhanced SuperSpeed
[ 34.191693] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.05
[ 34.191694] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 34.191696] usb usb4: Product: xHCI Host Controller
[ 34.191697] usb usb4: Manufacturer: Linux 5.5.10-200.fc31.x86_64 xhci-hcd
[ 34.191698] usb usb4: SerialNumber: 0000:3e:00.0
[ 34.191851] hub 4-0:1.0: USB hub found
[ 34.191860] hub 4-0:1.0: 2 ports detected
[ 34.363952] thunderbolt 0-0: ignoring unnecessary extra entries in DROM
[ 75.347447] pcieport 0000:07:02.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 75.347465] xhci_hcd 0000:3e:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 75.347473] xhci_hcd 0000:3e:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 75.347487] xhci_hcd 0000:3e:00.0: Controller not ready at resume -19
[ 75.347488] xhci_hcd 0000:3e:00.0: PCI post-resume error -19!
[ 75.347489] xhci_hcd 0000:3e:00.0: HC died; cleaning up
[ 75.347502] xhci_hcd 0000:3e:00.0: remove, state 4
[ 75.347505] usb usb4: USB disconnect, device number 1
[ 75.347736] xhci_hcd 0000:3e:00.0: USB bus 4 deregistered
[ 75.347806] xhci_hcd 0000:3e:00.0: remove, state 4
[ 75.347809] usb usb3: USB disconnect, device number 1
[ 75.348004] xhci_hcd 0000:3e:00.0: Host halt failed, -19
[ 75.348006] xhci_hcd 0000:3e:00.0: Host not accessible, reset failed.
[ 75.348063] xhci_hcd 0000:3e:00.0: USB bus 3 deregistered
wagoodman.mlbx, this looks like some incorrect udev rules placed by your distro. Please run
grep 10de /lib/udev/rules.d/*
and post the output.
I tried nvidia 450.57 to test out the newer PM features on GeForce GTX 1650, and every reboot I got this error preventing me from actually using the laptop.
I am still testing this out but there’s that one workaround that allowed me to use the laptop. Add “modprobe.blacklist=nvidia” on the kernel command line before it boots.
The theory of the WA is that nvidia driver must be loaded later and that the “modprobe.blacklist=nvidia” on the kernel command line does not actually disable/prevent loading the nvidia driver. That line just prevents it from being loaded automatically, though it still gets loaded by the time X starts.
Something tells me that a race is going on somewhere, maybe because nvidia assumes that the GPU is completely powered off (D3cold) when it’s not actually off, but after some time Linux’s runtime PM powers off the GPU which coincidentally happens before the time X starts. Then the nvidia driver loads, proceeds to power on (D0) the GPU and succeeds.
It’s not fool proof though, that there are still chances where this error could still pop up and freeze the system on its wake.
I agree that there is a race somewhere because I have the problem maybe 80% of the time: I need to boot my laptop 5 times so that it finally initializes everything. When it gets stuck, I have the following in dmesg:
[ 15.796957] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 15.796998] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 15.797882] iwlwifi 0000:3b:00.0: HW_REV=0xFFFFFFFF, PCI issues?
[ 15.888498] iwlwifi: probe of 0000:3b:00.0 failed with error -5
And these are the only error messages I have in dmesg. Even then, my laptop boots up to the graphical interface and everything, but none of the network cards can be found if the initialization failed.
When the initialization suceeds, dmesg contains (amongst other of course):
[ 12.288394] iwlwifi 0000:3b:00.0: enabling device (0000 -> 0002)
[ 12.322697] iwlwifi 0000:3b:00.0: firmware: direct-loading firmware iwlwifi-9260-th-b0-jf-b0-46.ucode
Excerpt of lspci (after a successful boot):
01:00.0 3D controller: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] (rev a1)
3b:00.0 Network controller: Intel Corporation Wireless-AC 9260 (rev 29)
I’m using a Debian testing system.
I am somewhat doubtful that what seems to be a wifi initialization issue is actually the same problem, but I actually have a NVIDIA card, and google does not show many pages with the exact same error message that I see…
I’ve completed my tests (quite a while back) regarding this issue and I seem to observe a pretty simple pattern.
PCI PM state changes doesn’t seem to occur instantly after it is requested meaning a state change from D0 → D3cold or vice versa may take a few seconds.
During that time, any state changes requested will result to “can’t change power state from D3cold to D0”. This could be that the PCI device cannot stop midway on the transition.
In my system (a laptop), I have two ways to trigger this error:
First way is to trigger a transition by writing ‘auto’ or ‘on’ to the PCI device’s PM depending on which is already set. Then, try loading the module.
For example in my system the result of
cat /sys/bus/pci/devices/0000:01:00.0/power/control is
on so I do a
echo auto > /sys/bus/pci/devices/0000:01:00.0/power/control.
After which, I try to load the nvidia module then the system freezes which that error.
The second way is using the fact that a udev rule sets the PCI device’s PM to “on” when unloaded. You just have to unload then reload the module.
The udev rule is from PCI-Express Runtime D3 (RTD3) Power Management in the Automated Setup section.
For example. After making sure that the nvidia module is loaded, I wait for about 2 minutes then run
modprobe -r nvidia; modprobe nvidia. This also freezes the system with that error.
So my conclusion? I think the foolproof way here is to never set the PCI device’s runtime PM control to ‘auto’ during or on boot up and only set it after the system has initialized. Or, to manually load the module after a successful boot or pretty much delay loading it as much as it can be delayed.
Installing optimus-manager and forcing the use of nvidia card seems to has sorted the issue for me on Manjaro XFCE
Is this this the right area for reporting this issue?
Back on September, I was still using the 450.57 driver with the 5.8 Linux kernel and Ubuntu 20.04.
On 5.10 Linux kernel with the 460.39 version, the workaround I mentioned does not work anymore.
Additionally, the “NVreg_DynamicPowerManagement=0x02” option is now the only way to suspend the NVIDIA GPU when not in use. The “NVreg_DynamicPowerManagement=0x01” will suspend it only if there are no applications running on it. But on 460.39, Xorg creates something like a persistent glxserver for NVIDIA. That counts as an application so the driver will never put it to suspend even if it is not being used.
For the driver development team, I think I have stumbled upon an easier reproduction steps to trigger this bug while the system is still running (making data collection possible)
- Reboot with nvidia, nvidia_drm and nvidia_modeset blacklisted. Make sure that these modules are not loaded but still can be loaded manually.
- Make sure that the
- Make sure that
nvidia-bug-report.sh which should eventually load the
nvidia kernel module.
- You get a
Killed message for
nvidia-smi or nothing for the `nvidia-bug-report.sh. Nevertheless, the bug should have been triggered, the GPU will not be usable, and the system is in the brink of crashing.
I attached the output and the dmesgs log for two cases:
Just running nvidia-bug-report.sh (I also captured the dmesg log after it)
nvidia-bug-report.tar.gz (694.3 KB)
Running nvidia-smi then nvidia-bug-report.sh. The nvidia-bug-report.sh hangs.
nvidia-smi.tar.gz (742.9 KB)
If more information is needed, please reply and I will try my best to provide it.
I am having the seame issue in my workstation with 2x V100 GPUs.
I can use Nvidia driver
450.119.03 with no issues but I get the following error during startup when I upgrade to Nvidia driver
460.73.01. Then, the startup stops at the following message and the graphical login is not shown.
`[10.404497] pcieport 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)`
I got this error originally with Ubuntu 18.04 (with kernels 5.4.0-54.60 and 5.4.0-73.82). In order to try solving this issue, I have upgraded to Ubuntu 20.04 (kernel 5.8.0-53.60), also upgrade my workstation BIOS, but it did not help. I have also tried blacklisting (and not) noveau driver. Following @ bugmenot.oss instructions, I have run
cat /sys/bus/pci/devices/0000:01:00.0/power/control (my GPU devices seems to be at
0000:2d:00) and observed that the config was
auto. Therefore, I could not edit the config because I had no write permission to
/sys, despite mounting it as read-only.
Had to return to driver 450 to get my Ubuntu graphics back. But I do need the 460 driver for development with CUDA 11.2.
Here are the nvidia-bug-report with Nvidia driver 450 and with 460.
nvidia-bug-report_with_nvidia-driver-450.log.gz (727.7 KB)
nvidia-bug-report_with_nvidia-driver-460.log.gz (562.7 KB)
Having the same issue with a GTX 1650 on Manjaro linux 21.2.0 running kernel 5.14 and NVIDIA 495.44 (hybrid graphics)
It seems to happen more frequently when rebooting from Windows, even with Fast Boot toggled off.
Here’s nvidia-bug-report on a correct boot (I cannot run it when the system crashes because it’s a full blown system hang)
nvidia-bug-report.log.gz (263.2 KB)
Surprisingly, I didn’t have this issues on Ubuntu. Although I have deleted the install and don’t remember which drivers were running there.
I have also noticed that disabling the NVIDIA card in the BIOS will stop the bug from happening.
I also encountered this problem.
I installed the nvidia driver for my optimus laptop and then my system wouldn’t boot consistently. 1 out of 5 times it would boot normally, 1/5 it would boot with Nvidia Daemon Failing and 3/5 it would not boot and freeze before login.
This is the “race” reported above, also when it booted with Nvidia Failing dmesg showed the bug in the title.
All I had to do was to blacklist the nvidia driver in /etc/modprobe.d/bumblebee.conf. This way the module is loaded after boot, when bumblebee is called.
This is explicitly mentioned in the configuration file which is created during bumblebee installation:
_# do not automatically load nvidia as it’s unloaded anyway when bumblebeed
_# starts and may fail bumblebeed to disable the card in a race condition.
Hope this helps someone.
Asus N551JW Laptop (Optimus)
Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M]
Driver Version: 460.91.03