Originally asked here - https://forum.manjaro.org/t/optimus-random-freezes-at-boot-cannot-change-nvidia-gfx-state-from-d3-to-d0/127089 where I was told about this being a possible nvidia driver bug.
Basically, I recently installed manjaro latest including latest nvidia hybrid driver, and prime (I have nvidia optimus).
Mostly, the setup works fine. But every 5-6 boots, the booting process randomly hangs after tlp system startup shutdown
. No errors/warnings in dmesg, journalctl. Just a freeze. And rebooting by pressing the power button makes it usually boot in the next try.
It is a fresh linux install, unmodified, so I have not configured anything on it.
I tried 3 different linux kernels. 5.4 lts, 5.5 & 5.6 rc3. I can reproduce the issue on all 3 kernels. But 5.6rc3 is the only one which displays an error, which says -
nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Additional information:
inxi -Fxxxz
System:
Host: manjaro Kernel: 5.5.6-1-MANJARO x86_64 bits: 64 compiler: gcc
v: 9.2.1 Desktop: KDE Plasma 5.17.5 tk: Qt 5.14.1 wm: kwin_x11 dm: SDDM
Distro: Manjaro Linux
Machine:
Type: Laptop System: Micro-Star product: PE62 7RD v: REV:1.0
serial: <filter> Chassis: type: 10 serial: <filter>
Mobo: Micro-Star model: MS-16J9 v: REV:1.0 serial: <filter>
UEFI: American Megatrends v: E16J9IMS.324 date: 03/23/2018
Battery:
ID-1: BAT1 charge: 38.1 Wh condition: 38.1/42.4 Wh (90%)
volts: 12.2/10.8 model: MSI BIF0_9 type: Li-ion serial: N/A
status: Full
CPU:
Topology: Quad Core model: Intel Core i7-7700HQ bits: 64 type: MT MCP
arch: Kaby Lake rev: 9 L2 cache: 6144 KiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
bogomips: 44817
Speed: 1200 MHz min/max: 800/3800 MHz Core speeds (MHz): 1: 1200
2: 1200 3: 1200 4: 1200 5: 1201 6: 1200 7: 1200 8: 1200
Graphics:
Device-1: Intel HD Graphics 630 vendor: Micro-Star MSI driver: i915
v: kernel bus ID: 00:02.0 chip ID: 8086:591b
Device-2: NVIDIA GP107M [GeForce GTX 1050 Mobile]
vendor: Micro-Star MSI driver: nvidia v: 440.59 bus ID: 01:00.0
chip ID: 10de:1c8d
Display: x11 server: X.Org 1.20.7 driver: modesetting,nvidia
alternate: fbdev,intel,nouveau,nv,vesa compositor: kwin_x11
resolution: 1920x1080~60Hz
OpenGL: renderer: Mesa DRI Intel HD Graphics 630 (Kaby Lake GT2)
v: 4.6 Mesa 19.3.4 compat-v: 3.0 direct render: Yes
Audio:
Device-1: Intel CM238 HD Audio vendor: Micro-Star MSI
driver: snd_hda_intel v: kernel bus ID: 00:1f.3 chip ID: 8086:a171
Sound Server: ALSA v: k5.5.6-1-MANJARO
Network:
Device-1: Intel Dual Band Wireless-AC 3168NGW [Stone Peak]
driver: iwlwifi v: kernel port: e000 bus ID: 02:00.0 chip ID: 8086:24fb
IF: wlp2s0 state: up mac: <filter>
Device-2: Qualcomm Atheros QCA8171 Gigabit Ethernet
vendor: Micro-Star MSI driver: alx v: kernel port: d000 bus ID: 03:00.0
chip ID: 1969:10a1
IF: enp3s0 state: down mac: <filter>
Drives:
Local Storage: total: 931.51 GiB used: 23.08 GiB (2.5%)
ID-1: /dev/sda vendor: HGST (Hitachi) model: HTS721010A9E630
size: 931.51 GiB speed: 6.0 Gb/s rotation: 7200 rpm serial: <filter>
rev: A3U0 scheme: GPT
Partition:
ID-1: / size: 244.47 GiB used: 23.05 GiB (9.4%) fs: ext4 dev: /dev/sda6
Sensors:
System Temperatures: cpu: 46.0 C mobo: 27.8 C
Fan Speeds (RPM): N/A
Info:
Processes: 223 Uptime: 3m Memory: 15.56 GiB used: 968.4 MiB (6.1%)
Init: systemd v: 242 Compilers: gcc: 9.2.1 Shell: zsh v: 5.8
running in: konsole inxi: 3.0.37
nvidia-bug-report.log.gz (206 KB)
Can someone from nvidia confirm whether this is a bug with the nvidia drivers or my hardware or linux or something else. @aplattner?
Also, I found [v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges - Patchwork , which may be related. But I am not sure whether it affects the proprietary driver + prime. Also, according to the last comment on that patch, it looks like they did not find a proper way to fix it anyways :(
Iām sorry, I donāt know offhand whether this is a system bug or a driver bug. Based on the symptoms it sounds like a system-level problem but itās hard to know for sure.
For now, I would recommend disabling dynamic power management by removing whatever rule is setting NVreg_DynamicPowerManagement=2 (presumably in /etc/modprobe.d/* somewhere).
I tracked down someone who is familiar with this problem and unfortunately it does sound like a bug in the system firmware rather than something we can fix in the driver. If I understand correctly, the system is powering the GPU off before the driver has a chance to load and initialize it properly.
Please check to see if there is a BIOS update available for your system.
No updates available. AFAICT, msi stopped updates for this laptop.
https://www.msi.com/Laptop/support/PE62-7RD.html
BIOS last update 2018.
Firmware last update 2017.
And I have no idea if there is way to contact msi. Even if there was a way to contact them, I lack any technical knowledge required to explain this problem to them.
So I am guessing, using linux with nvidia on this laptop is a no-go? @aplattner
Iām experiencing the same problem, however, it appears that my nvidia card isnāt showing up in lspci:
sudo lspci |grep -E "VGA|3D"
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
Though Iām seeing similar dmesg info at boot:
[ 34.191562] xhci_hcd 0000:3e:00.0: xHCI Host Controller
[ 34.191644] xhci_hcd 0000:3e:00.0: new USB bus registered, assigned bus number 4
[ 34.191652] xhci_hcd 0000:3e:00.0: Host supports USB 3.1 Enhanced SuperSpeed
[ 34.191693] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.05
[ 34.191694] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 34.191696] usb usb4: Product: xHCI Host Controller
[ 34.191697] usb usb4: Manufacturer: Linux 5.5.10-200.fc31.x86_64 xhci-hcd
[ 34.191698] usb usb4: SerialNumber: 0000:3e:00.0
[ 34.191851] hub 4-0:1.0: USB hub found
[ 34.191860] hub 4-0:1.0: 2 ports detected
[ 34.363952] thunderbolt 0-0: ignoring unnecessary extra entries in DROM
[ 75.347447] pcieport 0000:07:02.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 75.347465] xhci_hcd 0000:3e:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 75.347473] xhci_hcd 0000:3e:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 75.347487] xhci_hcd 0000:3e:00.0: Controller not ready at resume -19
[ 75.347488] xhci_hcd 0000:3e:00.0: PCI post-resume error -19!
[ 75.347489] xhci_hcd 0000:3e:00.0: HC died; cleaning up
[ 75.347502] xhci_hcd 0000:3e:00.0: remove, state 4
[ 75.347505] usb usb4: USB disconnect, device number 1
[ 75.347736] xhci_hcd 0000:3e:00.0: USB bus 4 deregistered
[ 75.347806] xhci_hcd 0000:3e:00.0: remove, state 4
[ 75.347809] usb usb3: USB disconnect, device number 1
[ 75.348004] xhci_hcd 0000:3e:00.0: Host halt failed, -19
[ 75.348006] xhci_hcd 0000:3e:00.0: Host not accessible, reset failed.
[ 75.348063] xhci_hcd 0000:3e:00.0: USB bus 3 deregistered
wagoodman.mlbx, this looks like some incorrect udev rules placed by your distro. Please run
grep 10de /lib/udev/rules.d/*
and post the output.
I tried nvidia 450.57 to test out the newer PM features on GeForce GTX 1650, and every reboot I got this error preventing me from actually using the laptop.
I am still testing this out but thereās that one workaround that allowed me to use the laptop. Add āmodprobe.blacklist=nvidiaā on the kernel command line before it boots.
The theory of the WA is that nvidia driver must be loaded later and that the āmodprobe.blacklist=nvidiaā on the kernel command line does not actually disable/prevent loading the nvidia driver. That line just prevents it from being loaded automatically, though it still gets loaded by the time X starts.
Something tells me that a race is going on somewhere, maybe because nvidia assumes that the GPU is completely powered off (D3cold) when itās not actually off, but after some time Linuxās runtime PM powers off the GPU which coincidentally happens before the time X starts. Then the nvidia driver loads, proceeds to power on (D0) the GPU and succeeds.
Itās not fool proof though, that there are still chances where this error could still pop up and freeze the system on its wake.
I agree that there is a race somewhere because I have the problem maybe 80% of the time: I need to boot my laptop 5 times so that it finally initializes everything. When it gets stuck, I have the following in dmesg:
[ 15.796957] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 15.796998] iwlwifi 0000:3b:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 15.797882] iwlwifi 0000:3b:00.0: HW_REV=0xFFFFFFFF, PCI issues?
[ 15.888498] iwlwifi: probe of 0000:3b:00.0 failed with error -5
And these are the only error messages I have in dmesg. Even then, my laptop boots up to the graphical interface and everything, but none of the network cards can be found if the initialization failed.
When the initialization suceeds, dmesg contains (amongst other of course):
[ 12.288394] iwlwifi 0000:3b:00.0: enabling device (0000 -> 0002)
[ 12.322697] iwlwifi 0000:3b:00.0: firmware: direct-loading firmware iwlwifi-9260-th-b0-jf-b0-46.ucode
Excerpt of lspci (after a successful boot):
01:00.0 3D controller: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] (rev a1)
3b:00.0 Network controller: Intel Corporation Wireless-AC 9260 (rev 29)
Iām using a Debian testing system.
I am somewhat doubtful that what seems to be a wifi initialization issue is actually the same problem, but I actually have a NVIDIA card, and google does not show many pages with the exact same error message that I seeā¦
1 Like
Iāve completed my tests (quite a while back) regarding this issue and I seem to observe a pretty simple pattern.
PCI PM state changes doesnāt seem to occur instantly after it is requested meaning a state change from D0 ā D3cold or vice versa may take a few seconds.
During that time, any state changes requested will result to ācanāt change power state from D3cold to D0ā. This could be that the PCI device cannot stop midway on the transition.
In my system (a laptop), I have two ways to trigger this error:
First way is to trigger a transition by writing āautoā or āonā to the PCI deviceās PM depending on which is already set. Then, try loading the module.
For example in my system the result of
cat /sys/bus/pci/devices/0000:01:00.0/power/control
is on
so I do a
echo auto > /sys/bus/pci/devices/0000:01:00.0/power/control
.
After which, I try to load the nvidia module then the system freezes which that error.
The second way is using the fact that a udev rule sets the PCI deviceās PM to āonā when unloaded. You just have to unload then reload the module.
The udev rule is from PCI-Express Runtime D3 (RTD3) Power Management in the Automated Setup section.
For example. After making sure that the nvidia module is loaded, I wait for about 2 minutes then run modprobe -r nvidia; modprobe nvidia
. This also freezes the system with that error.
So my conclusion? I think the foolproof way here is to never set the PCI deviceās runtime PM control to āautoā during or on boot up and only set it after the system has initialized. Or, to manually load the module after a successful boot or pretty much delay loading it as much as it can be delayed.
Installing optimus-manager and forcing the use of nvidia card seems to has sorted the issue for me on Manjaro XFCE
Hello everyone,
Is this this the right area for reporting this issue?
Back on September, I was still using the 450.57 driver with the 5.8 Linux kernel and Ubuntu 20.04.
On 5.10 Linux kernel with the 460.39 version, the workaround I mentioned does not work anymore.
Additionally, the āNVreg_DynamicPowerManagement=0x02ā option is now the only way to suspend the NVIDIA GPU when not in use. The āNVreg_DynamicPowerManagement=0x01ā will suspend it only if there are no applications running on it. But on 460.39, Xorg creates something like a persistent glxserver for NVIDIA. That counts as an application so the driver will never put it to suspend even if it is not being used.
For the driver development team, I think I have stumbled upon an easier reproduction steps to trigger this bug while the system is still running (making data collection possible)
- Reboot with nvidia, nvidia_drm and nvidia_modeset blacklisted. Make sure that these modules are not loaded but still can be loaded manually.
- Make sure that the
/sys/bus/pci/devices/0000:01:00.0/power/control
is on
not auto
- Make sure that
/sys/bus/pci/devices/0000:01:00.0/power/runtime_status
reads active
- Run
nvidia-smi
or nvidia-bug-report.sh
which should eventually load the nvidia
kernel module.
- You get a
Killed
message for nvidia-smi
or nothing for the `nvidia-bug-report.sh. Nevertheless, the bug should have been triggered, the GPU will not be usable, and the system is in the brink of crashing.
I attached the output and the dmesgs log for two cases:
-
Just running nvidia-bug-report.sh (I also captured the dmesg log after it)
nvidia-bug-report.tar.gz (694.3 KB)
-
Running nvidia-smi then nvidia-bug-report.sh. The nvidia-bug-report.sh hangs.
nvidia-smi.tar.gz (742.9 KB)
If more information is needed, please reply and I will try my best to provide it.
I am having the seame issue in my workstation with 2x V100 GPUs.
I can use Nvidia driver 450.119.03
with no issues but I get the following error during startup when I upgrade to Nvidia driver 460.73.01
. Then, the startup stops at the following message and the graphical login is not shown.
`[10.404497] pcieport 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)`
I got this error originally with Ubuntu 18.04 (with kernels 5.4.0-54.60 and 5.4.0-73.82). In order to try solving this issue, I have upgraded to Ubuntu 20.04 (kernel 5.8.0-53.60), also upgrade my workstation BIOS, but it did not help. I have also tried blacklisting (and not) noveau driver. Following @ bugmenot.oss instructions, I have run cat /sys/bus/pci/devices/0000:01:00.0/power/control
(my GPU devices seems to be at 0000:15:00
and 0000:2d:00
) and observed that the config was auto
. Therefore, I could not edit the config because I had no write permission to /sys
, despite mounting it as read-only.
Had to return to driver 450 to get my Ubuntu graphics back. But I do need the 460 driver for development with CUDA 11.2.
Here are the nvidia-bug-report with Nvidia driver 450 and with 460.
nvidia-bug-report_with_nvidia-driver-450.log.gz (727.7 KB)
nvidia-bug-report_with_nvidia-driver-460.log.gz (562.7 KB)
Having the same issue with a GTX 1650 on Manjaro linux 21.2.0 running kernel 5.14 and NVIDIA 495.44 (hybrid graphics)
It seems to happen more frequently when rebooting from Windows, even with Fast Boot toggled off.
Hereās nvidia-bug-report on a correct boot (I cannot run it when the system crashes because itās a full blown system hang)
nvidia-bug-report.log.gz (263.2 KB)
Surprisingly, I didnāt have this issues on Ubuntu. Although I have deleted the install and donāt remember which drivers were running there.
I have also noticed that disabling the NVIDIA card in the BIOS will stop the bug from happening.
I also encountered this problem.
I installed the nvidia driver for my optimus laptop and then my system wouldnāt boot consistently. 1 out of 5 times it would boot normally, 1/5 it would boot with Nvidia Daemon Failing and 3/5 it would not boot and freeze before login.
This is the āraceā reported above, also when it booted with Nvidia Failing dmesg showed the bug in the title.
All I had to do was to blacklist the nvidia driver in /etc/modprobe.d/bumblebee.conf. This way the module is loaded after boot, when bumblebee is called.
This is explicitly mentioned in the configuration file which is created during bumblebee installation:
From /etc/modprobe.d/bumblebee.conf:
_# do not automatically load nvidia as itās unloaded anyway when bumblebeed
_# starts and may fail bumblebeed to disable the card in a race condition.
Hope this helps someone.
Running On:
Asus N551JW Laptop (Optimus)
Linux 5.10.0-13-amd64
Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M]
Driver Version: 460.91.03