300MHz GPU frequency limit for RTX A4000 on Linux

Hi there!
I have a problem with my RTX A4000 on Linux setup:

OS: Linux msi-pc 5.17.3-gentoo-x86_64 #1 SMP PREEMPT Thu Apr 14 03:37:10 EEST 2022 x86_64 AMD Ryzen Threadripper 2950X 16-Core Processor AuthenticAMD GNU/Linux
Nvidia drivers: x11-drivers/nvidia-drivers-510.60.02:0/510::gentoo

The problem is that GPU frequency was way low comparing to specs 70-200MHz when running long and heavy PyTorch ML (CUDA) tasks, GPU is utilized at 100% for hours. So when I’m adding frequency offset +1000MHz (lower doesn’t affect anything) - it sticks to 300MHz and never goes up (or even down)

Meanwhile same PC has 1500+Mhz on Windows 11 with latest drivers. Please help

UPD: same with 470.103.01
nvidia-bug-report.log.gz (354.9 KB)

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

1 Like

Done

HW Power Brake Slowdown : Active

The mainboard is emitting the PWRBRK signal. Interesting that this doesn’t have any effect with the Windows driver.
Since the dmidecode output wasn’t included in the logs, which model/brand is the mainboard?
https://forums.developer.nvidia.com/t/rtx-a5000-stuck-at-400-500mhz-due-to-hw-power-brake-slowdown-on-ubuntu-20-04-3/189868?u=generix

1 Like

MB is: X399 GAMING PRO CARBON AC

Please check if up-/downgrading the bios changes anything. It’s a bit odd that a gaming mainboard emits PWRBRK, usually only server/workstation boards do this if an “uncertified” graphics card is inserted. So I suspect MSI accidentally enabled this.

1 Like

But why it works on Windows 11 then… is it possible to ignore that signal via driver settings etc.?

Might be that the Nvidia Windows driver ignores this, though it’s a hardware signal. Please check nvidia-smi -q output on Windows.
No way to disable this with the Linux driver, you can only use tape/nail polish to cover the pcie pin.

1 Like

Windows under the load:

Clocks Throttle Reasons
    Idle                              : Not Active
    Applications Clocks Setting       : Not Active
    SW Power Cap                      : Active
    HW Slowdown                       : Active
        HW Thermal Slowdown           : Not Active
        HW Power Brake Slowdown       : Active
    Sync Boost                        : Not Active
    SW Thermal Slowdown               : Not Active
    Display Clock Setting             : Not Active


Clocks
Graphics : 354 MHz
SM : 354 MHz
Memory : 4996 MHz
Video : 959 MHz
Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Default Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7001 MHz
Video : 1950 MHz

Linux under the load:
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active

Clocks
Graphics : 390 MHz
SM : 390 MHz
Memory : 7500 MHz
Video : 1140 MHz
Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Default Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7001 MHz
Video : 1950 MHz

UPD: I’m lost a little bit… what do these clocks mean Graphics 300MHz and Graphics 1560MHz at the same time? I have benchmarked my neural network on both OSes. +/- same speed

Seems Windows is hit by the same issue. Please downgrade bios, check if PWRBRK gets disabled, then file a bug report with MSI against their beta bios.
Clocks
Graphics : 390 MHz
Means the current clock is 390MHz

Applications Clocks
Graphics : 1560 MHz
means the (adjustable) limit when running cuda jobs.

1 Like

So I have tried older BIOSes - no luck. Moving card from PCIe X16 to PCIe X8 solved the issue, but now it’s X8 :/ Will try to disable pin30 and put it back into X16. The only question left - how to locate the right pin30? on which side? or both?

Anyway, thanks a lot for the help!

Like said, rather contact MSI, I don’t think they activated this on purpose.
On taping the pin: maybe ask the people in the thread I linked.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.