Consistent performance with RTX 2080

Hello,
I’m using triton inference server for serving deep learning models.
In order to benchmark my models, I’m trying to get a consistent performance out of the GPU. Unfortunately now there is a large variance in the inference time.

I’ve noticed that after a short time of not sending inference requests, the GPU will go to performance level P8. When running inference requests, the GPU performance level goes to P2.

  • Is there initial overhead when moving between performance levels? Like from p8 to p2.
  • Is it possible to maintain a high performance level? Without going to P8 when the gpu is idle.
  • Why the gpu model won’t go to P0 performance level?
  • In general, what settings should I tweak? And what could be the reasons for inconsistency in inference time.

The P2/P8 transition will certainly have some effect. I would normally recommend setting persistence mode, but I don’t know if that can be set on a RTX 2080 or not. It’s a common observation that consumer grade GPUs may not go to P0 performance level in the cases where you might expect it. With a bit of searching you may find suggestions for tweaking things via the control panel - I don’t know those tweaks offhand.

The clocks at which various parts of the card are operating change dynamically based on load, temperature, power supply, etc. Idle or lightly loaded components will clock down, including the PCIe interface. This is all designed to wring maximum performance from the silicon given that performance increases between silicon processing generations have become small (a.k.a. Moore’s Law is dead)

To my knowledge, there are no user-controllable knobs for either GPU power-state management or boost clock selection, but operating at lower temperatures often allows higher boost clocks.

On Teslas there used to be the ability to set fixed application clocks. Not sure whether that capability still exists. The basic idea was to ensure that all GPUs in a cluster can be forced to run at the same speed. If I recall correctly, application clocks have never been applicable to consumer cards like the RTX 2080.

Not sure what you mean by “overhead”. Do you mean a delay from when relevant input parameters change to when the power state switches? By observation, there is some amount of hysteresis in the state switching. The details of the state management are not publicly disclosed and may change with GPU BIOS and driver version for all I know. One could guess that the state management uses some sort of PID control to smooth out state transitions.

I am not sure why the power management for your card does not get into P0 state, or whether P0 state is even supported for your card. I have two Quadros here, one of which is a Quadro RTX 4000 (so also based on Turing architecture), and under continuous heavy load both currently operate at P0, with one running at 79 deg C and the other running at 85 deg C.

In general it is not possible to get exactly repeatable performance out of a modern GPU unless you can replicate all environmental factors exactly, which generally is not possible in practice. This doesn’t mean that all performance variations you observe at the application level are necessarily due to this, in particular not if the application uses random data in any part of the processing.

Benchmarking in the presence of performance fluctuations has been an issue for a long time. One common strategy is to report the fastest of N identically configured runs.

you may find suggestions for tweaking things via the control panel

I have never discovered anything relevant to compute performance in the control panel. Maybe there are some undocumented registry keys somewhere that people have reverse engineered, but I am not aware of any.

This is not the only thread, but this is an example of what I had in mind.

In NVIDIA’s documentation, P0 and P2 are described very vaguely as “maximum 3D performance” and “balanced 3D performance and power”. In practical terms, P0 seems to allow somewhat faster memory clocks, but both P0 and P2 seem to equally allow GPU core clock boosting to occur.

Some GPUs are apparently configured so P2 state is the highest state that can be reached while running compute workloads. This would indicate to me that NVIDIA has found that compute workloads cannot run reliably at the boosted memory clocks, and that there is no point in trying to use P0 state. I don’t think this is used to aid market segmentation because the resulting performance differences at app level are quite limited (a 15% boost to memory clock doesn’t result in a 15% boost to app performance).

Where ten (or maybe fifteen) years ago a constant engineering margin of about 20% was maintained on relevant processor metrics to allow reliable operation across a wide range of operating conditions and component ages, today’s CPUs and GPUs exploit most of that margin by boosting clocks and power dynamically in ever more complicated schemes. I wouldn’t be surprised if some are now using deep learning techniques to achieve near optimal power-state switching and clock boosting behavior.

Kind of like the “last hurrah” of current silicon processing technology.

Yes, there are hacker-level tools out there to manipulate power-state management and clock-boosting for NVIDIA GPUs. I won’t point anybody to them because there is no way to guarantee reliable GPU operation. I was an avid overclocker in my younger years, and I know that clocks that are set too high can lead to hard to track failures. My favorite example is that the square root instruction of a processor’s FPU – but no other instructions best I could tell – would occasionally return a wrong result. Took me several days to track down an app-level failure to that root cause. Lesson learned.