The clocks at which various parts of the card are operating change dynamically based on load, temperature, power supply, etc. Idle or lightly loaded components will clock down, including the PCIe interface. This is all designed to wring maximum performance from the silicon given that performance increases between silicon processing generations have become small (a.k.a. Moore’s Law is dead)
To my knowledge, there are no user-controllable knobs for either GPU power-state management or boost clock selection, but operating at lower temperatures often allows higher boost clocks.
On Teslas there used to be the ability to set fixed application clocks. Not sure whether that capability still exists. The basic idea was to ensure that all GPUs in a cluster can be forced to run at the same speed. If I recall correctly, application clocks have never been applicable to consumer cards like the RTX 2080.
Not sure what you mean by “overhead”. Do you mean a delay from when relevant input parameters change to when the power state switches? By observation, there is some amount of hysteresis in the state switching. The details of the state management are not publicly disclosed and may change with GPU BIOS and driver version for all I know. One could guess that the state management uses some sort of PID control to smooth out state transitions.
I am not sure why the power management for your card does not get into P0 state, or whether P0 state is even supported for your card. I have two Quadros here, one of which is a Quadro RTX 4000 (so also based on Turing architecture), and under continuous heavy load both currently operate at P0, with one running at 79 deg C and the other running at 85 deg C.
In general it is not possible to get exactly repeatable performance out of a modern GPU unless you can replicate all environmental factors exactly, which generally is not possible in practice. This doesn’t mean that all performance variations you observe at the application level are necessarily due to this, in particular not if the application uses random data in any part of the processing.
Benchmarking in the presence of performance fluctuations has been an issue for a long time. One common strategy is to report the fastest of N identically configured runs.
you may find suggestions for tweaking things via the control panel
I have never discovered anything relevant to compute performance in the control panel. Maybe there are some undocumented registry keys somewhere that people have reverse engineered, but I am not aware of any.