Understanding Optimal GPU Temperature and Default GPU Fan Curve (NVIDIA RTX 6000 Ada)

Hello,

we are using NVIDIA RTX 6000 Ada GPUs in production environments (uptime with load is 24/7).

Recently we are facing problems with the GPUs getting too hot, throttling down at 90°C and hence preventing our system from working correctly. While investigating possible solutions for better air ventilation, we also tried to understand the options we have at hand concerning the GPU itself. We use driver version 535.274.02 as the production environment is still based on DeepStream 7.0.

Using nvidia-smi, we monitored the temperature of a first GPU deployed in a production environment with improved air ventilation. We see that at stable temperatures around 85°C, the GPU Fan settles at just 60% speed.

When using nvidia-smi to capture the GPU fan curve of a second GPU deployed in a production environment with poor air ventilation, we see the fan speed only slowly rising after almost reaching the critical slowdown temperature of 90°C (the graph shows temperature-speed pairs obtained every second).

During these analyses, the following questions came up.

a) Is it correct that the default configuration of said driver / the VBIOS of the GPU regulates the GPU fan to only 60% speed at 85°C? Would this be the optimal target temperature to increase the GPUs life expectancy?

b) Is there a way to make the fan speed react more promptly to temperature changes to avoid continuously going in and out of the throttling state due to the temperature creeping around at 90°C?

c) When opening nvidia-settings in a production environment running said GPU and driver version, the fan information does not appear (unsupported). Is there a way to fix this? Unfortunately, right now we do not have the option to install newer drivers.

d) Also, as seen above, the nvidia-settings show a slowdown threshold of 100°C, while we observe the throttling to happen already at 90°C. Is there a reason for this?

e) Even after reading the manual of nvidia-settings it is not entirely clear to us how to interpret the following output of nvidia-smi.

Temperature
        GPU Current Temp                  : 88 C
        GPU T.Limit Temp                  : 4 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 85 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A

Where would we find the slowdown temperature of 90°C?
What do these values mean in the context of the current GPU temperature of 88°C?
Especially, what do the two negative values imply? They do not seem to change with changes of the current GPU temperature.

f) On a test GPU, we tried to lower the GPU target temperature from 85°C to 65°C to understand if the fan speed increases accordingly, however the speed did not change. Does this change need a reboot of the system to be applied?

Thank you very much for your time!