GPU Temperature: Quadro RTX 8000

junapsantos · April 22, 2024, 12:07pm

I am writing to seek clarification regarding the QUADRO RTX 8000 GPUs. I currently own four QUADRO RTX 8000 GPUs for neural network (NN) training with tensorflow.

Versions:
Driver 535
CUDA: 12.2
Tensorflow 2.15

During the NN training and optimization, I’ve observed that the temperature of the GPUs tends to stabilize around 82/83 degrees Celsius. These training sessions often take long hours, or even 2-3 days, I am concerned about the long-term effects of running code for long hours at this temperature on the GPUs.

Could you please advise if maintaining a temperature within this range for extended periods can damage the GPUs? I would appreciate any recommendations or strategies to optimize the GPU temperature to ensure longevity.

njuffa · April 22, 2024, 4:13pm

Continuous full computational load leading to GPU temperatures around 80 °C is not unusual and not alarming per se. What temperatures are typically reached depends on the GPU and the specifics of the cooling solution. I am not familiar with the QUADRO RTX 8000, but as a high-end GPU I would expect it to run hot.

GPUs have built-in thermal protection. There are various thresholds to that effect. For example, on one of my GPUs, there is GPU Target Temperature set to 83 °C, GPU Slowdown Temp set to 101 °C, and GPU Shutdown Temp set to 104 °C. Running nvidia-smi -q can show these limits. If the limits are crossed, the GPU will first be slowed down, and in the worst case halt.

Generally speaking, all semiconductor devices age physically, and this ageing process is accelerated at higher operating temperatures. Semiconductor devices are therefore designed with engineering margins and temperature limits designed to ensure that the device remains fully operable over the intended lifetime and duty cycle. This is a statistical computation, i.e. it is a question of probabilities, not certainty. I don’t know what targets are used for the design of Quadro GPUs, but one might assume a duty cyle of 100% (since it is a professional grade GPU) and a target lifetime of 5+ years. In the case of GPUs, the temperature limits are monitored and enforced.

From observation, hitting the lowest thermal limit on a GPU will reduce or eliminate the use of clock boosting, i.e. the operating frequency of the GPU is limited to reduce power draw and thus thermal load. A reduction in operating frequency is accompanied by a reduction in computation performance. Look for Thermal Slowdown in the output of nvidia-smi -q to find out whether this is the case. If your system is affected, check the Fan Speed reported by nvidia-smi. It should obviously be greater than 0%, but it may not reach 100% even under full load depending on ambient temperature. Check the airflow around the GPU in the case (enclosure). Is it obstructed by cabling or other PCIe cards? Over time, fans and heat sink fins often become coated in a layer of dust which impedes heat removal. This can be cleaned away using a can of compressed air.

Since GPU temperature is a function of power dissipation, you could also experiment with the power limit of the GPU, for example by setting the Requested Power Limit lower than the Max Power Limit displayed by nvidia-smi. There is a non-linear relationship between power draw and temperature, and reducing the power limit by, say 20%, may have only a small negative impact on overall performance. You would have to determine experimentally what works best for your use case.

junapsantos · April 22, 2024, 4:32pm

Thank you! That was very helpful. I found out that my GPUs target temperature is 84 C. Do you think there is any advantage in decreasing the target temperature a little? I do deep learning with 3D images which can take a long time to optimize.

njuffa · April 22, 2024, 4:36pm

~~To my knowledge GPU target temperature is not a user-settable limit~~. Other than improving cooling, e.g. by lower ambient temperature, you can experiment with the GPU power limit which is settable via nvidia-smi to indirectly influence the temperature. Setting the power limit likely requires administrator privileges.

[Later:] I belatedly noticed that the capability for users to set GPU target temperature via nvidia-smi was added some years ago, with the -gtt command-line switch. As I was not aware of that switch until just now, I have zero experience with the use of this switch.

Running a GPU at low temperatures may offer performance advantages by allowing a high clock boost. The nominal operating frequency for a GPU maybe 1500 MHz, and the maximum possible clock boost 1850 MHz, but by observation that maximum clock boost is generally unachievable if GPU temperature is above 60°C. Reducing GPU operating temperature slightly is likely to cause an insignificant improvement in GPU clock boosting.

As I stated earlier, the ageing processes in semiconductor devices accelerate with higher temperatures and slow down with lower temperatures, so people who intend to keep their semiconductor devices functional for longer than usual, say 10 years, might be interested in running their equipment cool, increasing the likelihood of survival. In my personal experience (= anecdotal evidence), power supplies, rotational storage, and system DRAM are usually the first components to fail in an aging system.

system · May 6, 2024, 4:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPU Temperature: Quadro RTX 8000 General Discussion cuda , tensorflow	0	779	April 10, 2024
Question : RTX A6000 Max operating temperature CUDA Programming and Performance gpu , nvidia-smi	3	5353	September 12, 2023
What is the optimal Temperature for GPU NVIDIA Quadro RTX 5000 General Topics and Other SDKs cuda	0	588	June 13, 2022
Is it normal for my Tesla P100-PCIE-16GB GPU to restart at 84°C? General Topics and Other SDKs cuda	4	85	December 29, 2024
What is the optimal Temperature for Nvidia RTX A6000 GPU - Hardware cuda , gpu	5	2535	September 20, 2024
Strange GPU Temperature CUDA Setup and Installation	0	169	July 27, 2024
GPU temperature keeps increasing just with a single memory allocation. CUDA 4.0 + CUDA Programming and Performance	7	15479	April 6, 2011
Nvidia-smi GPU target temperature / Maximum Operating Temperature Drivers - Linux, Windows, MacOS	4	10677	May 2, 2025
K80 GPU0 overheat in compatible server CUDA Setup and Installation	0	165	May 6, 2024
Tesla C2050 Temperature Questions CUDA Programming and Performance	3	8828	October 11, 2011

GPU Temperature: Quadro RTX 8000

Related topics