On Ryzen Mobile, Turing GPU dynamic power management seems to be broken (on driver 435.21)

I have an ASUS TUF FX505DV, which comes with a Ryzen 7 3750H CPU and a RTX 2060 GPU.

The PRIME render offloading feature itself seems to work fine, however the dynamic power management does not work.

I’ve done all the steps from the “automated” section here, https://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/dynamicpowermanagement.html , although I still have to manually enable automatic power management as described at https://devtalk.nvidia.com/default/topic/957981/linux/prime-render-offloading-on-nvidia-optimus/post/5373502 (see post #63). The kernel module parameter is also set to 0x02, and judging by /proc/driver/nvidia/params it appears to be recognized by the driver.

However, the GPU appears to always be on, nvidia-smi reports this:

$ nvidia-smi
Fri Oct 11 13:35:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8     4W /  N/A |     16MiB /  5934MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1201      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

and system power management, this:

$ cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_status /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_suspended_time 
active
0

Is this beacause the CPU (and the chipset) does not support the needed ACPI power management features? Is this a misconfiguration on my part or is this a driver problem?

This is the only thing that prevents me from using Linux 100% of the time on this machine, so any help will be much appreciated.
nvidia-bug-report.log.gz (664 KB)

Just as a note, you can’t really use nvidia-smi for runtime pm detection since it will wake up the gpu.
The setup seems to be correctly done, so when
cat /sys/bus/pci/devices/0000:01:00.0/power/control
returns “auto” and
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
still returns “active” then this doesn’t seem to work right now in conjunction with an amd platform. Doesn’t necessarily mean that it doesn’t support this, I’d rather guess that the driver doesn’t expect this combo right now.

I think that’s the case, this is pretty sad, but maybe an easy fix?

In dmesg, it says that

[    0.736369] pci 0000:01:00.0: PME# supported from D0 D3hot

So I guess both the platform and the GPU support this, but the driver just doesn’t try to make use of it.

That message is irrelevant. Maybe you can get some more info from the driver by setting the module parameter
NVreg_ResmanDebugLevel=0

Just did that, the bug report is attached.

Another observation is that when I changed udev config so that auto power management is enabled as soon as the device is added, the GPU actually did switch to a suspended state for some time.

$ cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_status /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_suspended_time 
active
4372

As you can see, the suspended time is actually more than 0, but the GPU does not suspend at any other time

nvidia-bug-report.log.gz (716 KB)

I guess that it is suspending only until the X driver loads. Then it is kept active.
With debug level set to info, there’s a log flood right now. Interesting would be an info about the PR3 method, probably just at driver loading time. Is it possible to disable X and right after boot run
sudo dmesg |grep PR3
to check if there are some messages about it?

There are no messages about PR3 in dmesg whether with or without X.

I also grepped the bug report and there isn’t anything about PR3 either (there’s something about _DOD), but I guess it’s not relevant.