GPU is throttled : NV_GPU_PERF_DECREASE_REASON_API_TRIGGERED

ansm · March 4, 2022, 2:06pm

Hi, we have a CUDA-based data-processing framework. Since recently we observe weak performance probably caused by throttling. After some investigation we found (using nvml and nvapi) that NvAPI_GPU_GetPerfDecreaseInfo returns NV_GPU_PERF_DECREASE_REASON_API_TRIGGERED. nvml returns nvmlClocksThrottleReasonSwPowerCap.

Who is responsible for setting this? Anyway suggestion to investigate further? Anyway to release the throttling?

We already crossed out some potential problems:

Driver is set to maximal performance.
GPU temperature (around 60 C) is ok.
PSU (1050W) is more than sufficient.

Hardware : 2080 RTX super
Driver Version : 461.92 studio driver
CUDA 11.2

Robert_Crovella · March 4, 2022, 3:11pm

nvidia-smi allows a privileged user to set a power cap on certain GPUs. I don’t think this is available on a RTX 2080 GPU, but if you have this set then it will limit the power the GPU is allowed to draw.

Otherwise all GPUs have a “maximum” power cap that prevents the GPU from exceeding its power draw design point. This is the meaning of the PowerCap indication. With respect to the maximum power cap, there is nothing you can do about that. This is a self-protection system for the GPU. You cannot work around it or disable it.

If you do nvidia-smi -a you should get a long list of data, and the section “Power Readings” or similar gives relevant data for this topic.

njuffa · March 4, 2022, 11:26pm

In the output of nvidia-smi -q, look for the two items Enforced Power Limit and Max Power Limit. If the former is equal to the latter, there is nothing further you can do. If the enforced limit is below the maximum limit, use nvidia-smi to set the enforced power limit to the maximum.

If your cooling lets the GPU operate at 60 deg C under load, it is cooling the GPU very well, and the impact of hot operation on power consumption should be minimal (for comparison, my GPUs typically operate at 83 to 85 deg C when continuous full load is applied).

The performance loss resulting from power throttling may be fairly small, as the memory clock is typically not reduced, and many applications are limited more by memory than by core performance. You may want to characterize the application’s performance with a roofline model, and then estimate how much performance is lost based on the reduction in core frequency due to power throttling (power throttling works by reducing both voltage and core frequency; there is a roughly linear relationship between voltage and operating frequency).

The thing to keep in mind is that the sophisticated dynamic clocking schemes of modern GPUs and CPUs have allowed exploiting the power and thermal envelopes fully, giving bonus performance over the guaranteed baseline performance. For example, your GPU may be specified for operation at ~1600 MHz baseline, but the maximum boost clock may be ~2000 MHz, giving up to 20% potential upside. It is not, however, reasonable to assume that any particular workload on any particular physical GPU should be able to run at the maximum boost clocks for an extended period of time.

[Later:]

I have a Tu104-based GPU here with a maximum boost clock of 1935 MHz at 1.025V. The RTX 2080 Super is basically a full-featured Tu104. Published information on the internet indicates that it reaches a maximum boost clock of 1965 MHz at 1.050V. You might want to compare the clock speed of your RTX 2080 Super under conditions of power throttling. The RTX 2080 Super is at the bleeding edge of what GDDR6 memory supports, so it would seem to be more likely to be performance limited by memory throughput than some other Tu104-based GPU models.

ansm · March 23, 2022, 9:26am

Thanks for your answers.

We found the problem and I’d like to give a brief explanation, if anyone stumbles into the same behaviour.

The machine was running on an older driver than it was supposed to. It ran on a 457.xx driver with CUDA 11.1 support. The software was already built against CUDA 11.2. Unexpectedly, this mismatch did not cause explicit failure, but implicitly led to SW throttling (nvmlClocksThrottleReasonSwPowerCap). After updating the Nvidia driver the throttling disappeared.

[edit]
The NV_GPU_PERF_DECREASE_REASON_API_TRIGGERED flag is still returned, but we do not see any performance reduction.

njuffa · March 23, 2022, 9:38am

Thanks for the feedback. That’s an interesting scenario that I have never seen before. I don’t have a mental model how such a version mismatch would lead to this kind of throttling.

Topic		Replies	Views
Power throttling observed with GPU stress test that calls cublas CUDA Programming and Performance	11	1669	October 20, 2023
Severe throttling on Thinkpad T14 Gen 1 with GeForce MX330 Linux linux , gpu	11	5440	December 25, 2022
GPU throttling? Video Processing & Optical Flow	1	765	November 18, 2019
1080GTX Ti GPU clock and power drawn is throttled all of a sudden! CUDA Programming and Performance	12	1726	July 31, 2018
Changing power management limit is not supported for GPU ( #Pascal #GTX1060 #laptop #mobile #Linux #555 #nvidia-smi #powerlimit ) Linux nvidia-smi	2	2177	July 5, 2024
Power source misreported as "undersized" & performance locked at low level Linux power , performance	10	1156	March 17, 2025
RTX 2070 doesn't report running at max clock rate in the P0 state. System Management and Monitoring (NVML)	5	2594	January 7, 2020
Nvidia-smi command is returning wrong value General Topics and Other SDKs	6	2737	March 29, 2023
GPU application clock and warming up in a CUDA application CUDA Programming and Performance	6	2790	October 27, 2021
GPU throttling at low temperature Linux	5	549	October 25, 2024

GPU is throttled : NV_GPU_PERF_DECREASE_REASON_API_TRIGGERED

Related topics