Hi, we have a CUDA-based data-processing framework. Since recently we observe weak performance probably caused by throttling. After some investigation we found (using nvml and nvapi) that NvAPI_GPU_GetPerfDecreaseInfo returns NV_GPU_PERF_DECREASE_REASON_API_TRIGGERED. nvml returns nvmlClocksThrottleReasonSwPowerCap.

Who is responsible for setting this? Anyway suggestion to investigate further? Anyway to release the throttling?

We already crossed out some potential problems:

  • Driver is set to maximal performance.
  • GPU temperature (around 60 C) is ok.
  • PSU (1050W) is more than sufficient.

Hardware : 2080 RTX super
Driver Version : 461.92 studio driver
CUDA 11.2

nvidia-smi allows a privileged user to set a power cap on certain GPUs. I don’t think this is available on a RTX 2080 GPU, but if you have this set then it will limit the power the GPU is allowed to draw.

Otherwise all GPUs have a “maximum” power cap that prevents the GPU from exceeding its power draw design point. This is the meaning of the PowerCap indication. With respect to the maximum power cap, there is nothing you can do about that. This is a self-protection system for the GPU. You cannot work around it or disable it.

If you do nvidia-smi -a you should get a long list of data, and the section “Power Readings” or similar gives relevant data for this topic.

In the output of nvidia-smi -q, look for the two items Enforced Power Limit and Max Power Limit. If the former is equal to the latter, there is nothing further you can do. If the enforced limit is below the maximum limit, use nvidia-smi to set the enforced power limit to the maximum.

If your cooling lets the GPU operate at 60 deg C under load, it is cooling the GPU very well, and the impact of hot operation on power consumption should be minimal (for comparison, my GPUs typically operate at 83 to 85 deg C when continuous full load is applied).

The performance loss resulting from power throttling may be fairly small, as the memory clock is typically not reduced, and many applications are limited more by memory than by core performance. You may want to characterize the application’s performance with a roofline model, and then estimate how much performance is lost based on the reduction in core frequency due to power throttling (power throttling works by reducing both voltage and core frequency; there is a roughly linear relationship between voltage and operating frequency).

The thing to keep in mind is that the sophisticated dynamic clocking schemes of modern GPUs and CPUs have allowed exploiting the power and thermal envelopes fully, giving bonus performance over the guaranteed baseline performance. For example, your GPU may be specified for operation at ~1600 MHz baseline, but the maximum boost clock may be ~2000 MHz, giving up to 20% potential upside. It is not, however, reasonable to assume that any particular workload on any particular physical GPU should be able to run at the maximum boost clocks for an extended period of time.


I have a Tu104-based GPU here with a maximum boost clock of 1935 MHz at 1.025V. The RTX 2080 Super is basically a full-featured Tu104. Published information on the internet indicates that it reaches a maximum boost clock of 1965 MHz at 1.050V. You might want to compare the clock speed of your RTX 2080 Super under conditions of power throttling. The RTX 2080 Super is at the bleeding edge of what GDDR6 memory supports, so it would seem to be more likely to be performance limited by memory throughput than some other Tu104-based GPU models.

Thanks for your answers.

We found the problem and I’d like to give a brief explanation, if anyone stumbles into the same behaviour.

The machine was running on an older driver than it was supposed to. It ran on a 457.xx driver with CUDA 11.1 support. The software was already built against CUDA 11.2. Unexpectedly, this mismatch did not cause explicit failure, but implicitly led to SW throttling (nvmlClocksThrottleReasonSwPowerCap). After updating the Nvidia driver the throttling disappeared.

The NV_GPU_PERF_DECREASE_REASON_API_TRIGGERED flag is still returned, but we do not see any performance reduction.

Thanks for the feedback. That’s an interesting scenario that I have never seen before. I don’t have a mental model how such a version mismatch would lead to this kind of throttling.