Xorg still in GPU with PRIME Offload and dynamic power management

My issue is rather straightforward: I have set up PRIME Render Offload and Runtime D3 power management as specified in the driver manual. However, I have the following:

$ nvidia-smi
Wed Mar 10 03:09:09 2021  
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P8     8W /  N/A |      5MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1936      G   /usr/lib/Xorg                       4MiB |
+-----------------------------------------------------------------------------+

Now, the claim is that nvidia-smi awakens the GPU to poll it. Alright, so I perform:

$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
active
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_suspended_time
0

The GPU is hence active, and has never suspended since boot.

Finally:

$ cat /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status:          Enabled (fine-grained)
Video Memory:               Active

GPU Hardware Support:
 Video Memory Self Refresh: Supported
 Video Memory Off:          Supported

What can I do? The worst thing is that my GPU’s P8 idle state is a whopping 9 watts: this is greater than the total system power consumption of many ultrabooks which have nailed Linux power management.

Many others with similar problems noticed that they have some Xorg program or another hooking into nvidia-smi or nvidia-settings: I have uninstalled the latter, and I have also tried this with sddm.service disabled and after a reboot (hence Xorg is never loaded): the power state is still stuck at P8, the video memory is still active, and the GPU is never suspended properly.

I am at my wits’ end here: hoping for a solution to this.

What does
cat /sys/bus/pci/devices/0000:01:00.0/power/control
output? It has to be set to ‘auto’
https://download.nvidia.com/XFree86/Linux-x86_64/460.56/README/dynamicpowermanagement.html

Hmm, it was on. However, I have the udev rule also set up exactly according to that page: not sure why it was as such.

Issuing echo auto > /sys/bus/pci/devices/0000:01:00.0/power/control worked, but for some reason, plugging in/unplugging the notebook sends it back to on.

Since I have no idea which distro you’re using there might be conflicting udev rules being set up or a prime manager running or some acpid/systemd-acpid config changing this.

I am using Arch Linux; no other Optimus managers, and no custom Xorg conf files besides what is mentioned in the manual.

In a nutshell, I have followed the D3 power management and PRIME Render Offload pages of the manual to the letter, and set up a file in /etc/Xorg.conf.d/10-gpu.conf containing:

Section "ServerLayout"
    Identifier "layout"
    Option "AllowNVIDIAGPUScreens"
EndSection

Maybe one issue at a time, on boot, the udev rules triggers once the driver is loaded, did you add the driver and the udev rule to the initrd?

I have the exact same problem on a 1650 Ti Max-Q with a i7-9750H notebook. One difference: I have a number higher than 0 in /sys/bus/pci/devices/0000:01:00.0/power/runtime_suspended_time it does tho never increment, when the XServer is running. I found out, the longer I take to login to the desktop session, the higher the number is. So it seems the suspend funcionality is working, but stops the moment the XServer is started.

In Arch Linux, the initrd is managed by mkinitcpio, and I have added the nvidia* modules as described here. I have also followed the udev rule setup of the manual, again in this page.

However, the driver (specifically, cat /sys/bus/pci/devices/0000:01:00.0/power/control) still reports that power management is on, rather than auto after a reboot, even despite the udev rule. I have also added said udev rule to the mkinitcpio.conf script and recreated the initramfs, to no avail. I have to manually write auto to the file above for power management to work, and even then, the GPU is stuck at P8, draws 9 W and has Xorg in the list of processes occupying the GPU.

You can’t use nvidia-smi to check runtime pm or even idle power consumption since it wakes up the gpu.

I suspected too that nvidia-smi would break the suspend so I did some testing.

First of all tho, I am using the AUR package optimus-manager in combination with optimus-manager-qt. Could be of course be the problem that its not working for me but so far I havent found settings optimus-manager is doing differently from what the nvidia documentation suggests. It does set the control variable successfully to auto for me every time, so may be the package is worth a try for you too @SRSR333 . Backup your xorg file tho, optimus manager does create a new one, which from then on can only be edited from /etc/optimus-manager/xorg . And it does put blacklist commands in lib/modprobe.d as well as /usr/lib/modprobe.d . It blacklists basically everything, noveau or nvidia related.

Since optimus-manager has a integrated, hybrid and nvidia method, I have checked the power consumption on all three of them. I used powertop for that, so I can stay away from the dGPU.

On integrated the dGPU runs at power state P0 @ 10 Watts and never changes (no management at all nvidia drivers are unloaded).
On hybrid it runs at P8 and 3 W and never changes (as long as I dont tax it).
Idling on nvidia it runs at P8 and 3W, so the same as hybrid. Yet every program started taxes it, since it runs the desktop session then.

So its definitly not suspending for me on hybrid mode.

I dunno tho if its maybe a hardware thing and it cant be suspended. I got me a script that issues several ACPI calls to suspend a dGPU, testing if a call does work. One worked but it froze my system. Killing the Xorg process running on the dGPU in hybrid does take me back to the desktop login manager, so effectively it kills the XServer.

So I am out of ideas at the moment. :/

Found a reddit comment of someone having the same issue on his laptop and he got noveau running as the driver in use on the dGPU. Blacklisting it, solved the issue for him. It did not for me tho, maybe worth a try for you @SRSR333

P.S.: here is a picture of my status monitoring of the nvidia driver in hybrid mode in conky. Of course I know the nvidia-smi calls in the conky script would break the suspend, so I wrapped them in an if-statement, checking on that suspend state first and that more frequent than the smi calls and I tested it for quite some time without conky running. Temperature is normally idling at 48°C, in this picture the cpu heated the gpu up passively: https://i.imgur.com/dPkewxL.jpg

Since the runtime pm is using acpi, you should first check for a bios update if anything in wonky. powertop is the way to go, it’s outputting reliable values (if the battery is good).

Thats an idea. Maybe I could try spoofing different windows versions too on the kernel startup. I have seen in the acpi dsdt this laptop was set up for a long list of windows versions.

If thats all not helping, it could be a hardware thing. I remember from attempts to setup a virtual machine with gpu offloading on another laptop that there are different ways to wire the iGPu and dGPU. Best thing to have would be a hardware muxer but only few laptops have that.

On windows this laptop has around 7 hours battery life on light office work. That would be 10 Watts and would be the same on Linux. So its maybe idling on Windows at 3W too.