Incorrect power management with PRIME configuration

eric.esteban28 · October 25, 2020, 9:29pm

Currently I am forced to use version 440 drivers because with the new versions 450 and 450 it does not make a correct use of the power management with the PRIME offloading configuration.

With the new drivers the graphics consumes more Watts when it is idle, so the computer heats up more and the fans are always running (with version 440 they are always off unless you are using Nvidia). It takes away a lot of autonomy.

I don’t know if the problem may be a configuration problem (it should have a correct configuration because with the 440 it works well) or if it is a driver problem. With the nvidia_smi command, with the device doing nothing, the power state is in the P8 state with 440 and P0 with 450 or 455.

Here you can see both nvidia_smi.

nvidia_smi 440:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 42C P8 1W / N/A | 16MiB / 5934MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 888 G /usr/lib/Xorg 14MiB |
±----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:01:00.0

nvidia_smi 455:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28 Driver Version: 455.28 CUDA Version: 11.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 51C P0 12W / N/A | 5MiB / 5934MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 910 G /usr/lib/Xorg 4MiB |
±----------------------------------------------------------------------------+

My equipment is an MSI GS65 Stealth 8SE (intel 8750HQ+RTX2060) with Manjaro KDE, kernel 5.8 and Xorg 1.20.9. Attached nvidia_bug_report of both drivers.

Thank you very much and excuse my bad english

440-nvidia-bug-report.log.gz (239.7 KB) 455-nvidia-bug-report.log.gz (374.0 KB)

eric.esteban28 · November 5, 2020, 8:51pm

Same problem with version 455.36, in this case it is in P8 mode but it still consumes 10-12Wats.

Some help?

Problems occur since offloadsink support was added with 450.57, maybe that’s the problem going back to 440.100

henriker · November 9, 2020, 10:06am

Same issue here on Lenovi X1E. Card stuck at 139 mhz, memory at full 3.5 ghz frequency according to nvtop. Extremely poor performance in games due to this. Started with 455.36, currently on 455.38 but problems are similar.

eric.esteban28 · November 10, 2020, 8:24pm

In my case, performance is good but I don’t have power management

eric.esteban28 · January 3, 2021, 11:38pm

The problem persists with the 460.27 beta. Now it seems that it no longer consumes that many watts in idle but the graphics card does not turn off completely.

nvidia smi:

Mon Jan 4 00:28:52 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 32C P8 2W / N/A | 5MiB / 5934MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 629 G /usr/lib/Xorg 4MiB |
±----------------------------------------------------------------------------+

The graphics card is in the correct fine-grained mode, but the video memory section never changes to a state that is not active. With powertop it also shows that the graphics card has 100% usage while with 440 usage it is 0%.

[…@… ~]$ cat /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status: Enabled (fine-grained)
Video Memory: Active

GPU Hardware Support:
Video Memory Self Refresh: Supported
Video Memory Off: Supported

I don’t understand how nobody gives me an answer after so many months without being able to update drivers. It only happens with my laptop? What could have changed so that since 440 I have no power management? The new versions of drivers are not to correct bugs and that everything works better?

generix · January 4, 2021, 2:58pm

First of all, please don’t use nvidia-smi for runtime pm debugging. It wakes up the gpu and then displays some momentary power usage which is rather useless.
Only use powertop while AC is disconnected to get the power readings.
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
should then display the state suspended.
In the bug report logs for the 455 driver, all states were correctly showing.
gpu state suspended
Video Memory off

eric.esteban28 · January 5, 2021, 10:00am

Thanks for answering.

The command always shows that the card is activated. It is certain that I do not have any application activating the card because it has happened to me in the past, if it is in this state it is for something else.

[…@… ~]$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
active

I’m also attaching the nvidia_bug log in case it can help.

With the 455, although the log shows that, it is sure that it did not work well because the fans did not stop rolling. On the nvidia_smi you can see the temperature difference between the 440 and 455 when they were doing the same thing. With the 460 the fans do not roll but you can feel the heat from the GPU because it does not turn off completely.

460-bug-report.log.gz (250.1 KB)

generix · January 5, 2021, 10:48am

Just noticed you’re running a 5.10 kernel, please see this:
https://forums.developer.nvidia.com/t/pci-express-runtime-d3-power-management-broken-by-commit-4d03e3cc59828/164901/2

eric.esteban28 · January 5, 2021, 3:18pm

Thanks.

I have switched to kernel 5.9 and it already shows that the card is in suspend mode, powertop also shows me that the card has 0% use (the consumption watts does not show them) … apparently everything is correct.

But the GPU fans keep turning on as with the 455 so there is still a problem.

[…@… ~]$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
suspended

[…@… ~]$ cat /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status: Enabled (fine-grained)
Video Memory: Off

GPU Hardware Support:
Video Memory Self Refresh: Supported
Video Memory Off: Supported

If after several minutes of non-use I launch nvidia_smi it shows me that the GPU is close to 50ºC when it should be just over 40º.

Tue Jan 5 16:01:19 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P5 9W / N/A | 5MiB / 5934MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 715 G /usr/lib/Xorg 4MiB |
±----------------------------------------------------------------------------+

I am attaching nvidia_bug_report with this kernel again.

460-linux59-bug-report.log.gz (272.3 KB)

generix · January 5, 2021, 4:17pm

Don’t know if this works at all, but can you try disabling the output sink feature by adding
Option "AllowPRIMEDisplayOffloadSink" "false"
inside the outputclass section of
/usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.conf

eric.esteban28 · January 6, 2021, 2:11pm

Nothing changes.

I already think the laptop has a problem because it was one of the first to come out with this combination of CPU + GPU and it has no solution. The sad thing is that with the 440 everything worked well.

I hope they show something new at CES and a few months I can change it because it’s a very annoying bug

generix · January 6, 2021, 2:33pm

You could check for a bios update, yours is from 2018.

eric.esteban28 · January 19, 2021, 8:54pm

I’ve been testing for a while with the 455 versions and the new 460 stable but it’s still the same.

It’s funny because with kernel 5.10 despite the bug at least it turns off the fans and it’s not that annoying.

I can’t update the BIOS either because then I have other CPU fan issues and I don’t know if it would be fixed either.

Anyway thanks for the help, at least it has been tried

eric.esteban28 · February 11, 2021, 9:52pm

the problem persists with version 460.39 and kernel 5.10 with which the power management problem is already fixed

eric.esteban28 · April 4, 2021, 8:53pm

I have retested with the latest beta 465.19.01 drivers and the same problem continues. I have also tried with the parameters of the above messages and neither.

But I have found that if I keep the card on (keeping the nvidia-settings open) it heats up less and I have better power management, the battery lasts a little less than when it was completely turned off. If I don’t open it, the card gets out of control, heats up more, consumes more, the fans turn on and the battery lasts quite little, and when i open nvidia-settings or launch nvidia-smi you can see that the card is quite hotter than it should.

This could confirm a failure of the laptop itself? Does the driver have any function when the card is off? Although I would not understand how it worked well with the 440.

I have also detected that some rules of the udev configuration fail me when starting the system. It is normal?

~ systemd-udevd[323]: 0000:01:00.1: /etc/udev/rules.d/90-mhwd-prime-powermanagement.rules:8 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.1/remove}, ignoring: No such file or directory
~ systemd-udevd[321]: 0000:01:00.2: /etc/udev/rules.d/90-mhwd-prime-powermanagement.rules:2 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.2/remove}, ignoring: No such file or directory
~ systemd-udevd[306]: 0000:01:00.3: /etc/udev/rules.d/90-mhwd-prime-powermanagement.rules:5 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.3/remove}, ignoring: No such file or directory

Attached photo of nvidia-settings comparing GPU temperature. And again the nvidia-bug-report.

nvidia-bug-report.log.gz (247.6 KB)

eric.esteban28 · April 11, 2021, 8:12pm

Is there any other command or registry that can help detect the problem? Maybe some ACPI record?

I have seen some user on reddit with a similar problem, in his case the GPU changes from active / suspended without reason (about 20 seconds in every 100 seconds) from 450 drivers, with 440 it worked well. The user’s laptop claims to be the Thinkpad X1 Extreme 2 with GTX1650. So they are isolated cases but it happens to several users.

I’m just trying to detect the problem so at least you can know what happens.

eric.esteban28 · August 3, 2021, 7:30pm

I try every driver update but it never fixes.

After summer I plan to change the computer but now the most common configuration is Ryzen 5800h + Nvidia.

But with that configuration does the power management work or will I still have the same problem?
In case it does not work, is it planned to implement it shortly?

eric.esteban28 · August 31, 2022, 7:18am

A year later I’m still with the problem. Any new suggestions to try?

I don’t know if anyone will answer me xD

generix · August 31, 2022, 7:58am

Your notebook’s bios simply doesn’t support runtime D3 (while it should). Instead, when then kernel suspends the gpu it will show the opposite effect of increased power draw. Well known issue, unfortunately. To get back to the previous behaviour, you need to disable runtime pm

sudo tee /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/control <<<on

and then maybe change/create the udev rule to do this automatically on boot.

generix · August 31, 2022, 8:49am

In that regard, you should really update your bios and check if that improves things.
https://www.msi.com/Laptop/GS65-Stealth-8SE/support#bios

Topic		Replies	Views
Ubuntu 20.04 - NVIDIA GPU consuming power even when using only integrated graphics card (Intel iGPU) Linux	40	9940	December 21, 2022
Xorg still in GPU with PRIME Offload and dynamic power management Linux	14	4254	October 27, 2022
NVIDIA-SMI Shows ERR! on both Fan and Power Usage Linux	32	46892	August 30, 2022
Driver issue on Ubuntu 19.10 Linux ubuntu	16	4176	April 5, 2020
BUG: `can't change power state from D3cold to D0 (config space inaccessible)`, stuck at boot Linux	21	57874	October 14, 2023
nvidia-smi not fully supported on GTX 1060 Linux	41	39240	January 17, 2018
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10245	October 14, 2021
Force GTX1080 performance level to reduce power consumption under Linux Linux	20	35292	March 17, 2025
[Regression 460 series] Black screen on boot: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer Linux	64	21559	January 7, 2024
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62409	February 14, 2021

Incorrect power management with PRIME configuration

Related topics