NVIDIA GPU 3090 performance mode setting

Hello,

I want to know how can I do if I want to set my RTX 3090 performance always be stable in performance P0 mode when I run my program.

Also, I do not know why it falls to P0 mode and consume large power when no program is running.

I have noticed a strange problem, that I run my same code on my RTX 3090. This time is half an year later than the last running. But I found that the performance results get average 1/3 slower than the results obtained half an year before. We running the same CUDA code with the same script.

I don’t know why it causes this. I suspect that something become unstable for the 3090 hardware. Could there be any possible reasons for that?

In my experience its generally not possible with GeForce products. The GPU driver decides what performance state to put the GPU in according to its own heuristics.

For enterprise/data center products, you can “lock” the performance behavior to some degree following instructions like this.

There are a few other threads on these forums with related questions/information. Here are a few:

1 2 3. There are others as well that you can find with a bit of searching.

1 Like

I agree that the combination of no running process reported, 5MB GPU memory usage, fan running at 32%, and temperature of 63 degree Celsius, does not really jibe with the substantial power draw of 141W.

If the card were affected by the common issue of accumulated dust clogging the heat sink fins, I would expect to see higher temperature and fan percentage as a result, so that does not seem to apply. FWIW, I clean my GPUs with compressed air (from a can!) once a year.

However, the measurements from various GPU sensors are not instantaneous, so it seems possible that a single snapshot captured a seemingly inconsistent state, e.g. right after a GPU-accelerated program has finished running. I would suggest trying continuous monitoring with the free GPU-Z tool from TechPowerUp to see if some pattern can be glimpsed from that.

I have made the observation that NVIDIA’s monitoring software underlying the nvidia-smi tool sometimes starts malfunctioning on Windows after a longer system uptime (usually > 1 month), requiring a system reboot to restore proper operation.

It is possible that a sensor on the GPU or the I2C controller involved in collecting that data has become defective, although I have never personally observed such a failure, and I have no idea how one would test such a hypothesis conclusively. Since the sensor data drives decisions about power states and clock boosting made by NVIDIA’s driver, this then could explain your performance observations.

That should not be happening. The only thing apart from a fault is if someone has set a high value for the GPU clock.

Does it change if you run:

nvidia-smi -rac

which should reset the GPU application clocks to default.

Thank you.

Actually, this is the results after I reboot the system already.

I tried nvidia-smi -rac.
Does this mean this command needs sudo permission?

Can I do this with sudo? ``nvidia-smi -pm 1’’

Many of the clock setting commands require administrative privileges. Also, some clock setting commands are not supported on consumer cards like the RTX 3090. The details of both of these restrictions have differed over time.

nvidia-smi -pm 1 turns on persistence mode. You would always want to do this under Linux. It is not needed under Windows. It is shown as Off in your snapshot of nvidia-smi output.

When I spoke of system reboot earlier, I actually meant power cycling the system. Sorry for being imprecise. Then, install the latest driver package (the nvidia-smi output displayed above indicates an older driver).

With administrative privileges secured (e.g. sudo), turn on persistent mode. Then, try to clear old clock settings with nvidia-smi, in particular --reset-applications-clocks and --reset-gpu-clocks

If you have a second RTX 3090 in a different system, you could try swapping the GPUs between the two systems. If the problem follows the GPU, that is a pretty good indication that the hardware is not functioning correctly.

Ok. Thank you very much.

Should we always install the latest driver? Actually I think my 545 is already very new although not the latest.

The problem is I run my code with even older driver faster on the same machine half an year before. Now it gets slower and I feel it’s weird.

Most certainly not, IMHO. However, when one encounters unusual problems without ready explanations what might be causing them, that is very strong motivation to switch to the latest available driver package. As long as a system is working fine, and none of the features introduced in newer drivers are needed, one can stay with an older driver for many months, even years.

None of us know what your exact machine configuration looked like half a year ago, and I would boldly claim that even you would be hard pressed to replicate it exactly. So it is best to move forward.

The command does need root permissions, but the response you show indicates that that particular card does not support changing application clocks. Generally Geforce level do not have this capability, but I wasn’t sure, with the 3090 being the top line card in the range.


It seems that 3090 does not permit set app clocks?

The ``system reboot’’ means: I use command sudo reboot in my Linux terminal to reboot the server, right?

Correct. The good news is that we can therefore eliminate the use of application clocks as causing the reported observations. It appears the --reset-gpu-clocks command went through, meaning all clocks should be reset to their default values now.

I actually have not done that in a while and cannot remember. I think sudo reboot is just a soft reboot, i.e. at no time is the power to the system turned off (which is the goal)? For power cycling you would want to ensure physical access to the system, so you can definitely turn the system back on after shutting it off. I think sudo shutdown or sudo systemctl poweroff? Definitely ask someone more knowledgeable for help.

Should I sudo reboot the server after reset the gpu clock?

The purpose of power cycling is to try and reset any hardware mechanism that has gotten stuck in a “bad” state and cannot be reset through software means, such as issuing a --reset-gpu-clocks via nvidia-smi.

If your system appears to act normally after use of --reset-gpu-clocks, there may be no need to power cycle the system.

Please try driver version 560.28.03. Some other beta drivers can do P0 too.

I’m running my 2080Ti, 3080Ti and TITAN V at constant P0 under compute load, graphics load and idle. I can overclock without worrying the clocks jumping up when CPU calculation app finishes.

Before I had the same problem: under load at overclocked P2 the GPU was fine and then when program finished the GPU jumped to P0 and the overclock went too high and GPU crashed.

Here are some settings:

nvidia-smi -pm 1

nvidia-settings -a GPUFanControlState=1 -a GPUTargetFanSpeed=99

set power limits
#titan V
nvidia-smi -i 0 -pl 200

#2x2080Ti
nvidia-smi -i 1 -pl 221
nvidia-smi -i 2 -pl 222

#3080Ti
nvidia-smi -i 3 -pl 330

set thermal limits
nvidia-smi -i 0 -gtt 82
nvidia-smi -i 1 -gtt 82
nvidia-smi -i 2 -gtt 82
nvidia-smi -i 3 -gtt 76

#allow P0
nvidia-smi --cuda-clocks=OVERRIDE
nvidia-settings -a GPUPowerMizerMode=1

Thank you.

I have several problems regarding your descriptions.

  1. What do you mean by “GPU crashed”? Does it mean GPU is not working or running slow?
    My weird problem is that my code gets slower than the results I ran on the same computer half a year before.

  2. For my GPU is RTX 3090, what specific parameters should I set? The numbers you show are for your cards.

  3. Could I only set these two “nvidia-smi --cuda-clocks=OVERRIDE
    nvidia-settings -a GPUPowerMizerMode=1”?

  1. Yes
    1+2. You set your own limits. The cards use much more energy when in P0.
    Depending on your cooling solution, environment temperature, airflow, program you run you need to set your own limits for power, temp and overclocking offsets for gpu and ram clocks. P0 is a totally different beast compared to P2. Prepare for days of tweaking.

ma 21. lokak. 2024 klo 17.11 Zwu065 via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com> kirjoitti: