Performance state switches from P0 to P2 when starting program

I have a python script, with which I train an RL agent in Ubuntu 22.04 on a server with two RTX 4090, but I only use one of them. When I start the program, the performance state jumps from p8 to p2. So far so good, but due to long training times, I need to have P0.

I have Cuda 12.2 and Drivers 535.113.01

I have set the performance state manually on the server as follows:

sudo nvidia-xconfig --enable-all-gpus --allow-empty-initial-configuration
sudo DISPLAY=:0 XAUTHORITY=/run/user/128/gdm/Xauthority nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1"

It sets P0, but as soon as I start the python script, it jumps to P2. When I stop execution, it switches back to P0.

My RTX 3080 Ti on my notebook reaches P0 automatically and is faster in training than the server with RTX 4090 in P2. What can I do to fix P0 when I start the script? I couldn’t find anything useful on the web. The only idea I would have is to go for overclocking…

1 Like

I have never experienced it personally, but based on reports from forum participants that I consider reliable sources, it seems that on various GPUs the highest performance state typically reached is P2 rather than P0. Your observation is therefore not necessarily a red flag or a clear indication that much performance is left on the table.

You have already noticed that trying to “sneakily” bypass the automatic power and clock management of modern GPUs will not necessarily work reliably.

The first thing you would want to check is whether the RTX 4090 is well cooled and well supplied with power. A fairly common problem with multiple-GPU configurations is restricted air flow due to close proximity of the GPUs. Another fairly common issue with multi-GPU configurations is an under-dimensioned power supply. My rule of thumb for a rock-solid setup over a projected hardware life cycle of five years is: The sum of the nominal power of all system components should not be much higher than 60% of the nominal rating of the power supply. So with dual RTX 4090s, you would want a 1600W PSU (I would recommend one with 80PLUS Platinum rating).

While applying an intense computational work load, check whether nvidia-smi reports any slowdown events, such as Power Cap or Thermal Slowdown. What is the GPU temperature under a sustained compute workload (check two minutes after start of workload)? For best performance with automated management it should be 60 deg C or lower (something that may not necessarily be achievable with air cooling depending, e.g. due to high ambient temperatures).

In nvidia-smi, check that Current Power Limit is equal to Max Power Limit. If it is lower, increase the power limit to the maximum usingnvidia-smi. When applying a power limit increase, make sure the PSU (power supply unit) provides sufficient reserves. In my experience, when raising the power limit to the allowed maximum, the percentage increase in power draw often exceeds the percentage increase in performance (the law of diminishing returns at work).

If it is supported for this GPU (consumer GPUs often have limited functionality in this regard compared to professional GPUs), try setting high application clocks. Alternatively look into locking clocks as discussed in the NVIDIA blog:

It is not clear to me when application clock setting should be preferred to clock locking, or vice versa. Clock locking is the newer mechanism as far as I know.

Indeed. I spent an hour yesterday testing on a friend’s 4080 and had an nvidia-smi dmon running throughout.
With an intense load and default clock settings, the power was being limited to 300W and the clock was running at around 2760Mhz.

When I starting using Nsight Compute, which locks the clock to the base frequency of 2200MHz, the power dropped to 202W. So a 50% increase in power gets a 23% gain in performance.

Thank you for your useful tips! Regarding the power supply, I have asked the IT stuff (who actually set-up the server) if there is enough power. So I will wait there for an answer.

In automatic power management, the temperature stays around 40 deg C, 84 W power consumption and 10-15% GPU utilisation while running the script. Fan stays at 0% all the time.

Howerver, I have set the current power limit to max. I checked the other attributes and compared my notebook RTX 3080 Ti (left) with the server RTX 4090 (right). See attached image. The temperature limits of the 4090 card are a bit weird. I think this might be the reason for a Thermal Slowdown? I am not an expert and didn’t want to risk to modify anything… I don’t know if (and how) I should change these Temperature limits?

As you said, the thermal limits shown for the RTX 4090 (right image) look bizarre. I have never seen anything like this. Were these GPUs purchased as factory-fresh from a vendor with a good reputation in your locale? Did anyone try to manipulate (“hack”) the VBIOS? Are you running with the latest available drivers?

I did not realize that the Max Power Limit for an RTX 4090 is 530W! Even with a 1600W power supply in the system you may be pushing your luck running two of these at this setting.

You would want to examine the Clocks Events Reasons (should directly follow Performance State in the output from nvidia-smi) to see whether any are active.

Well it is a Lambda Labs Vector Workstation with MSI Suprim X graphic cards. And there are no Clocks Events Reasons set… It is really very strange. But I asked now the Lambda Team if they know what to do.

Btw: there is a 3500 W power supply for this server. So I think the problem is not there.

Anyhow, I would be thankful for any suggestions how to solve this…

I was just running through possible scenarios I have encountered before. You stated that you have two RTX 4090 in this system. Do both show these whacky temperature limits in nvidia-smi? If you ever figure out the root cause of that, please let us know here for future reference.

As for your original question regarding P2 being the maximum power management stage reached under a compute workload, this may be normal based on previous discussions in this forum. The internal workings of the power management and clock control are not publicly documented by NVIDIA, and it therefore seems unlikely that someone from NVIDIA will comment on it here.

What is important from a performance perspective is what clocks are reached and in that regard there seemed to be no red flag in the nvidia-smi output shown earlier. Did you try setting application clocks or clock locking, like I suggested earlier?

Yes unfortunately the second RTX 4090 shows exactly the same strange temperature limits… I am currently in contact with Lambda Labs for their support to resolve this. I will share insights and solutions here, once I have something useful.

I tried setting application clocks with nvidia-smi -ac but this gives the warning “Setting applications clocks is not supported”.
I have then set graphic and memory clocks with -lgc and -lmc to their respective maximums. Memory clock attains its maximum, but graphic clocks not (2790 Mhz set of 3160 Mhz maximum).
The performance is really only a little bit better, nothing remarkable. My only hope is that the guys from Lambda Labs can come up with a solution…

I would think that 3160 MHz is the maximum boost clock for this GPU. In my experience, the maximum boost clock is usually not sustainable with any GPU for more than a burst of a few seconds due to environmental factors monitored by the GPU (temperature, power draw, and voltage*). Mostly I see GPUs “maxing out” around 90% to 92% of the maximum theoretical boost clock. With water-block cooling that keeps GPU temperature below 40°C at all times it may be possible to permanently operate closer to maximum boost clock, but I do not have practical hands-on experience with that.

(*) When GPUs run at high clocks, their supply voltage needs to increase as part of the dynamic management process, to keep the chip running stably. But as for all electronics there is a voltage limit to prevent hardware damage, which is around 1.1V for modern semiconductor manufacturing processes. Conversely, there is also a low voltage limit, typically around 0.7V.

1 Like

Well I think in my case there might be some further problem, not only clock speeds. The weird thing is still the temperature setting, that absolutely does not make sense. Not even Lambda Labs, the vendor, could come up with a solution yet. Would be very helpful if a Nvidia engineer could have a look at it…

Since Lambda Labs is an NVIDIA-approved integrator, they should have a support contact at NVIDIA, and this kind of hardware-related issue (wacky temperature limits) seems like the kind of issue they would want to take up with that contact. Note: I have no knowledge of how NVIDIA structures the relationship with their integrators, but this seems like the correct path to move things forward on that issue.

1 Like

Well let’s wait and hope for their support and solution. I will keep this discussion here updated.

P0 means Maximum 3D performance, I’m guessing that p0 can’t use CUDA

After a very long time, I got some answers about the strange temperature limits. According to the provider, these values are fixed and there seems to be no direct meaning. All other graphic cards have the same values.
I just accept it as it is and hope to get max performance when using intensive deep learning models…

Have you tried setting nvidia-smi --cuda-clocks=1 ?

1 Like

No, but eventually I’ll try it out. Thank you!

I had the exact same issue, it has been solved by following: Increase Performance with GPU Boost and K80 Autoboost | NVIDIA Technical Blog

Basically, identifying the maximum allowed clocks and setting them with sudo nvidia-smi -ac 3004,875 -i 0

Hopefully that will fix it for you as well!