Power throttling observed with GPU stress test that calls cublas

Team, Im curious to know why the clock frequency drops (see snapshot below) upon running this gpu_burn test GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test that essentially calls cublasSgemm_v2 calls in a back to back manner

Upon running this test i find the nvidia-smi complaining of SW Power Cap. How are we guaranteed to reach the machine peak in terms of TFLops if such behaviour is observed even with an optimized Sgemm library call ?

You may observe the pclk frequency dropping to less than 1000 MHz once i begin the gpu burn application

This is an expected observation in some cases. The most obvious example of this that I have seen is on the T4 processor.

There is no such guarantee.

I m observing this with A10

The general effect is observable on many GPUs. Issuing large back-to-back GEMM calls have the ability to push many GPUs into a power-capping situation.

I think “peak”, is the operative word here and TFlop specs based on them should be regarded as aspirational. Quoting from the Ampere GA102 whitepaper, in the fine print at the end of the listed specs:

“1. Peak rates are based on GPU Boost Clock.”

1 Like

The power draw of a GPU is maximized at a particular mix of compute throughput and memory throughput. Large(-ish) matrix multiplies tend to have an activity profile fairly close to this mix. I would generally consider this a Good Thing™ as it basically shows that GEMM makes full use of available GPU resources.

(1) Check with nvidia-smi that Current Power Limit is identical to Max Power Limit, otherwise adjust the power limit with nvidia-smi -pl.

(2) All other factors being equal, lower operating temperatures achieved by aggressive cooling can lower power draw slightly (think single digit percent). Keeping GPU temperatures below 60°C would be the goal.

Whats the threshold for this compute and memory mix, should we ensure we don’t hit this threshold for better performance without underclocking

That is different for every GPU model. To give a rough idea from vague memory, about 30% to 40% of the memory bandwidth and as much compute as you can fit into the remaining instructions (that used to mean maximizing use of FP32 FMAs).

You can create a synthetic workload that allows you to balance FP32 and memory throughput and see which specific mix results in the highest power consumption for a particular model. If you build a database of different GPU models you can sell this as a GPU burn-in test :-)

In general, when you want to maximize performance, and the code you are running is only available as a binary executable, the knobs you can turn is (1) use nvidia-smi to adjust the power limit to the maximum allowed (2) set application clocks or use clock locking to keep GPU clocks at constant high level [may not be supported on consumer GPUs]. (3) cool the GPU aggressively; usually that means water cooling.

If you are interested in maximizing efficiency (best performance per watt), you would typically lower the power limit and dial in lower clocks via application clock setting or clock locking. You would need to experiment to find the optimal setting for your workload.

If you have access to source code, increasing the performance of an application using guidance from the CUDA profiler is the best way to go. Because bottlenecks tend to shift during an optimization process, you may need multiple rounds of code changes. You may also need to revisit the process with each new major GPU architecture. It would not hurt to study the CUDA Best Practices Guide for some general guidelines.

Im interested in max performance per price . So must invest in advanced cooling or play with the fan speed to get there ?

I do not have a good overview of the performance / price situation. It is also dependent on your workload(s): Does it require double precision? Does it need large GPU memory? Can it make use of tensor cores? Maybe some website has done the leg work and provides an overview.

Aggressive cooling is only going to provide an incremental increase in performance, and it likely will worsen the performance / dollar ratio. You may want to check relevant tweaking sites to see what kind of performance gains they observe with what kind of equipment, and the cost of that equipment.

When you look at cost, make sure to distinguish between cap-ex (upfront capital expenditure) and op-ex (ongoing cost of operation). Modern high-end processors, both CPUs and GPUs, are so power thirsty that their operation easily results in a noticeably increased electricity bill, especially when operating in something close to a 24/7 regime. And depending on where you are in the world, op-ex (lets say over a useful hardware lifespan of five years) can outstrip cap-ex when the cost per kWh is high.

As njuffa has well covered, subject to exact requirements, and comparing to your A10, more than one RTX4080 could be purchased for the same money, with better FP32 performance, albeit at higher power consumption.