Incorrect power management with PRIME configuration

Same issue here on Lenovi X1E. Card stuck at 139 mhz, memory at full 3.5 ghz frequency according to nvtop. Extremely poor performance in games due to this. Started with 455.36, currently on 455.38 but problems are similar.

In my case, performance is good but I don’t have power management

The problem persists with the 460.27 beta. Now it seems that it no longer consumes that many watts in idle but the graphics card does not turn off completely.

nvidia smi:

Mon Jan 4 00:28:52 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 32C P8 2W / N/A | 5MiB / 5934MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 629 G /usr/lib/Xorg 4MiB |
±----------------------------------------------------------------------------+

The graphics card is in the correct fine-grained mode, but the video memory section never changes to a state that is not active. With powertop it also shows that the graphics card has 100% usage while with 440 usage it is 0%.

[…@… ~]$ cat /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status: Enabled (fine-grained)
Video Memory: Active

GPU Hardware Support:
Video Memory Self Refresh: Supported
Video Memory Off: Supported

I don’t understand how nobody gives me an answer after so many months without being able to update drivers. It only happens with my laptop? What could have changed so that since 440 I have no power management? The new versions of drivers are not to correct bugs and that everything works better?

First of all, please don’t use nvidia-smi for runtime pm debugging. It wakes up the gpu and then displays some momentary power usage which is rather useless.
Only use powertop while AC is disconnected to get the power readings.
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
should then display the state suspended.
In the bug report logs for the 455 driver, all states were correctly showing.
gpu state suspended
Video Memory off

1 Like

Thanks for answering.

The command always shows that the card is activated. It is certain that I do not have any application activating the card because it has happened to me in the past, if it is in this state it is for something else.

[…@… ~]$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
active

I’m also attaching the nvidia_bug log in case it can help.

With the 455, although the log shows that, it is sure that it did not work well because the fans did not stop rolling. On the nvidia_smi you can see the temperature difference between the 440 and 455 when they were doing the same thing. With the 460 the fans do not roll but you can feel the heat from the GPU because it does not turn off completely.

460-bug-report.log.gz (250.1 KB)

Just noticed you’re running a 5.10 kernel, please see this:
https://forums.developer.nvidia.com/t/pci-express-runtime-d3-power-management-broken-by-commit-4d03e3cc59828/164901/2

Thanks.

I have switched to kernel 5.9 and it already shows that the card is in suspend mode, powertop also shows me that the card has 0% use (the consumption watts does not show them) … apparently everything is correct.

But the GPU fans keep turning on as with the 455 so there is still a problem.

[…@… ~]$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
suspended

[…@… ~]$ cat /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status: Enabled (fine-grained)
Video Memory: Off

GPU Hardware Support:
Video Memory Self Refresh: Supported
Video Memory Off: Supported

If after several minutes of non-use I launch nvidia_smi it shows me that the GPU is close to 50ºC when it should be just over 40º.

Tue Jan 5 16:01:19 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P5 9W / N/A | 5MiB / 5934MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 715 G /usr/lib/Xorg 4MiB |
±----------------------------------------------------------------------------+

I am attaching nvidia_bug_report with this kernel again.

460-linux59-bug-report.log.gz (272.3 KB)

Don’t know if this works at all, but can you try disabling the output sink feature by adding
Option "AllowPRIMEDisplayOffloadSink" "false"
inside the outputclass section of
/usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.conf

1 Like

Nothing changes.

I already think the laptop has a problem because it was one of the first to come out with this combination of CPU + GPU and it has no solution. The sad thing is that with the 440 everything worked well.

I hope they show something new at CES and a few months I can change it because it’s a very annoying bug

You could check for a bios update, yours is from 2018.

I’ve been testing for a while with the 455 versions and the new 460 stable but it’s still the same.

It’s funny because with kernel 5.10 despite the bug at least it turns off the fans and it’s not that annoying.

I can’t update the BIOS either because then I have other CPU fan issues and I don’t know if it would be fixed either.

Anyway thanks for the help, at least it has been tried

the problem persists with version 460.39 and kernel 5.10 with which the power management problem is already fixed

I have retested with the latest beta 465.19.01 drivers and the same problem continues. I have also tried with the parameters of the above messages and neither.

But I have found that if I keep the card on (keeping the nvidia-settings open) it heats up less and I have better power management, the battery lasts a little less than when it was completely turned off. If I don’t open it, the card gets out of control, heats up more, consumes more, the fans turn on and the battery lasts quite little, and when i open nvidia-settings or launch nvidia-smi you can see that the card is quite hotter than it should.

This could confirm a failure of the laptop itself? Does the driver have any function when the card is off? Although I would not understand how it worked well with the 440.

I have also detected that some rules of the udev configuration fail me when starting the system. It is normal?

~ systemd-udevd[323]: 0000:01:00.1: /etc/udev/rules.d/90-mhwd-prime-powermanagement.rules:8 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.1/remove}, ignoring: No such file or directory
~ systemd-udevd[321]: 0000:01:00.2: /etc/udev/rules.d/90-mhwd-prime-powermanagement.rules:2 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.2/remove}, ignoring: No such file or directory
~ systemd-udevd[306]: 0000:01:00.3: /etc/udev/rules.d/90-mhwd-prime-powermanagement.rules:5 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.3/remove}, ignoring: No such file or directory

Attached photo of nvidia-settings comparing GPU temperature. And again the nvidia-bug-report.

nvidia-bug-report.log.gz (247.6 KB)

Is there any other command or registry that can help detect the problem? Maybe some ACPI record?

I have seen some user on reddit with a similar problem, in his case the GPU changes from active / suspended without reason (about 20 seconds in every 100 seconds) from 450 drivers, with 440 it worked well. The user’s laptop claims to be the Thinkpad X1 Extreme 2 with GTX1650. So they are isolated cases but it happens to several users.

I’m just trying to detect the problem so at least you can know what happens.

I try every driver update but it never fixes.

After summer I plan to change the computer but now the most common configuration is Ryzen 5800h + Nvidia.

But with that configuration does the power management work or will I still have the same problem?
In case it does not work, is it planned to implement it shortly?

A year later I’m still with the problem. Any new suggestions to try?

I don’t know if anyone will answer me xD

Your notebook’s bios simply doesn’t support runtime D3 (while it should). Instead, when then kernel suspends the gpu it will show the opposite effect of increased power draw. Well known issue, unfortunately. To get back to the previous behaviour, you need to disable runtime pm

sudo tee /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/control <<<on

and then maybe change/create the udev rule to do this automatically on boot.

In that regard, you should really update your bios and check if that improves things.
https://www.msi.com/Laptop/GS65-Stealth-8SE/support#bios

The BIOS update does not fix the problem, I have updated and nothing. With the “cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status” command it keeps saying that the GPU is suspended but more power is consumed, it gets hotter and the fans turn on.

But it is not that it has never worked for me, with versions 440 of the driver or earlier it worked well, the problem is the later versions of the driver.

From what you say it is understood that it is an MSI problem and they should be the ones who solve it

I suspect with driver 440, runtime pm wasn’t used.