Understanding Optimal GPU Temperature and Default GPU Fan Curve (NVIDIA RTX 6000 Ada)

Hello,

we are using NVIDIA RTX 6000 Ada GPUs in production environments (uptime with load is 24/7).

Recently we are facing problems with the GPUs getting too hot, throttling down at 90°C and hence preventing our system from working correctly. While investigating possible solutions for better air ventilation, we also tried to understand the options we have at hand concerning the GPU itself. We use driver version 535.274.02 as the production environment is still based on DeepStream 7.0.

Using nvidia-smi, we monitored the temperature of a first GPU deployed in a production environment with improved air ventilation. We see that at stable temperatures around 85°C, the GPU Fan settles at just 60% speed.

When using nvidia-smi to capture the GPU fan curve of a second GPU deployed in a production environment with poor air ventilation, we see the fan speed only slowly rising after almost reaching the critical slowdown temperature of 90°C (the graph shows temperature-speed pairs obtained every second).

During these analyses, the following questions came up.

a) Is it correct that the default configuration of said driver / the VBIOS of the GPU regulates the GPU fan to only 60% speed at 85°C? Would this be the optimal target temperature to increase the GPUs life expectancy?

b) Is there a way to make the fan speed react more promptly to temperature changes to avoid continuously going in and out of the throttling state due to the temperature creeping around at 90°C?

c) When opening nvidia-settings in a production environment running said GPU and driver version, the fan information does not appear (unsupported). Is there a way to fix this? Unfortunately, right now we do not have the option to install newer drivers.

d) Also, as seen above, the nvidia-settings show a slowdown threshold of 100°C, while we observe the throttling to happen already at 90°C. Is there a reason for this?

e) Even after reading the manual of nvidia-settings it is not entirely clear to us how to interpret the following output of nvidia-smi.

Temperature
        GPU Current Temp                  : 88 C
        GPU T.Limit Temp                  : 4 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 85 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A

Where would we find the slowdown temperature of 90°C?
What do these values mean in the context of the current GPU temperature of 88°C?
Especially, what do the two negative values imply? They do not seem to change with changes of the current GPU temperature.

f) On a test GPU, we tried to lower the GPU target temperature from 85°C to 65°C to understand if the fan speed increases accordingly, however the speed did not change. Does this change need a reboot of the system to be applied?

Thank you very much for your time!

Hi @lukas.kofler,

GPU Temps are quite a complex topic and has been discussed here before, e.g. What is the optimal Temperature for Nvidia RTX A6000 - #10 by Frank_Quadro

nvidia-smi is constantly being adjusted to accommodate all our GPUs, and temp measurements have changed over the years to include many different on-board sensors. That means some of the temp values shown by nvidia-smi are based on multiple sensors.

I guess you know the “man page” of nvidia-smi? https://docs.nvidia.com/deploy/nvidia-smi/index.html

It details the meaning of the values. AS a summary:

  • T.Limit is the default remaining diff to the slowdown limit
  • Slowdown T.Limit is the value, that the overall sensor data suggests that you are already “closer” to the slowdown

In your case default slowdown would be (88 + 4), but the sensor data shows you are already 2 degrees past that, so it would be at a measured 90 degrees.

The 60% fan is likely the optimal factory setting for the fan curve to keep the GPU at 85 degrees max temp.
On most reference design GPUs you will not be able to control the fan curve with normal OS tools I am afraid. NVIDIA offers access to the NVML API that allows manual control of the fans, see NVML API Reference Guide :: GPU Deployment and Management Documentation

Hi Lukas, thanks for all these good questions…

the actively cooled workstation GPUs are aimed to be used in (under-the-desk) workstations, and as that, NOISE=fan rpm needs to be a limiting factor.
Your (not unique) usecase of likely favoring lower temps over more noise we currently DO NOT COVER, so we do not have a way for you to allow to ‘just run the fans faster’ = more noisy…
Your best approach for such configs is to provide better (cold) airflow TO the GPU, so it would stay cooler… (for the server version of our GPUs = passively cooled, all such control is in the central element of the server chassis cooling then…).
With the noise being a critical parameter for actively cooled cards, each product has its own value for the (noise limited) max. rpm, which we show as a percentage of the vendor specified absolute max rpm of only the fan. So yes, like 60% is correct (a).

(b) basically NO, assuming we are not talking like thousands of GPUs per month? Work on the chassis cooling, which will impact the GPU cooling logic…
(c) between some 2 generations of our GPUs we changed the temp mechanisms (and readings) substantially, I would assume this is a glitch resulting from that change. Unless it still repro’s with current drivers (580/590 branch), I don’t see a chance to fix/backport this (only for older drivers, your 535 branch..) I’m sorry…
(d) might also be a result of old driver, changed temp logic and readings, but not sure… again, we won’t add any other but security fixes to old r535, so for a test, if you could check how all this looks and works with a r580/590 driver?

(e) one intention of the new ‘logic’ was to have homogenous (delta)temp values for all our different SKUs, and only store a GPU specific absolute temp with the GPU. That’s why we now show t_limit, as the delta you are running from the SW slowdown temp, which varies per GPU… t_avg being the average of all internal GPU temp sensors (likely shows as current temp)…

The target temp is a range, we want the GPUs to stay within over a period of time, short spikes of individual sensors spiking higher are fine….
This in your case reads to: you run at 88°, which is 4° away from SW slowdown, which for this sku would then be 92°. HW slowdown would be -2°, so 2° above that 92°, with a HW SHUTdown kicking in at -7°. We no longer show you fixed values (that would be different per GPU), but only your delta to that, and your absolute average temp at any time…
[server temp and fan control have much easier logic now, focusing only on that t_limit value, and keeping it positive…] make (some) sense?
(f) as we don’t offer you any modification of temp and rpm, I wonder how you even change target temp ?? (though I’m not really familiar what tolls we offer on Linux actually…).

The way we qualify and certify the (active cooled) products, we guarantee proper function, and accept RMA for failures, (with a given fan-inlet air temp, the chassis vendor needs to guarantee to the GPU, for any situation and thermal profile the OEM certifies its chassis for… )
As that, we can’t really offer for users to modify any of the qualified and hence guaranteed presets..

Lukas, can you tell (DM) me, what type of appliance and usecase you run, and roughly the number of GPUs…? I’m still collecting ‘needs’, and reporting them to product owners and engineering leads…

many thanks

-Frank

Quick disclaimer: In case you find a mismatch between Frank’s and my comment, refer to Frank’s :-)

Thank you for jumping in here!

Hi @MarkusHoHo , hi @Frank_Quadro ,

thank you both for taking the time to provide such detailed answers!

I guess you know the “man page” of nvidia-smi?

Yes, but as stated above the information written on the manual page was not entirely clear to us. Your details and Frank’s details on nvidia-smi helped to get us a better understanding of nvidia-smi’s output, especially regarding the negative values, thanks again!

NVIDIA offers access to the NVML API that allows manual control of the fans

Thanks for pointing that out, we will look into that if necessary.

(b) basically NO, assuming we are not talking like thousands of GPUs per month? Work on the chassis cooling, which will impact the GPU cooling logic…

No, we are not talking about thousands of GPUs per month. We will focus more on improving the chassis cooling then.

I would assume this is a glitch resulting from that change.

Regarding question c), this is an oversight on our side. Turns out that you need to have a physical monitor connected to the GPU in order to overwrite fan speed settings using nvidia-settings. Having connected a physical monitor, we were able to control the fan speed using nvidia-settings on the old driver version 535.274.02 as well. Sorry for not having exhaustively tested this prior to asking, this limitation of nvidia-settings was not known to us.

if you could check how all this looks and works with a r580/590 driver?

We updated one production environment to version 580.126.09, but the slowdown temperature still shows as 100°C in nvidia-settings, but results to approx. 90°C in nvidia-smi. However, this is just a minor detail and not of great importance.

make (some) sense?

Yes, thanks for the detailed explanation.

I wonder how you even change target temp ??

According to the manual page of nvidia-smi, setting the target temperature is possible:

-gtt, --gpu-target-temp=MODE

Set GPU Target Temperature for a GPU in degrees celsius. Target temperature should be within limits supported by GPU. These limits can be retrieved by using query option with SUPPORTED_GPU_TARGET_TEMP. Requires Root.

While a target temperature set at 65°C is reflected when running nvidia-smi -q, the fan speed does not increase accordingly. Also, the setting is not stored persistently, i.e., after a reboot, the target temperature rose again to 85°C.

can you tell (DM) me, what type of appliance and usecase you run, and roughly the number of GPUs…?

The use case per GPU is running a DeepStream inference pipeline with up to ten cameras and several different AI models. The number of GPUs is not more than two-digits.

Thanks again for your time and help, very much appreciated.
The most important points are now clearer to us.