GPU Temperature: Quadro RTX 8000

I am writing to seek clarification regarding the QUADRO RTX 8000 GPUs. I currently own four QUADRO RTX 8000 GPUs for neural network (NN) training with tensorflow.

Versions:
Driver 535
CUDA: 12.2
Tensorflow 2.15

During the NN training and optimization, I’ve observed that the temperature of the GPUs tends to stabilize around 82/83 degrees Celsius. These training sessions often take long hours, or even 2-3 days, I am concerned about the long-term effects of running code for long hours at this temperature on the GPUs.

Could you please advise if maintaining a temperature within this range for extended periods can damage the GPUs? I would appreciate any recommendations or strategies to optimize the GPU temperature to ensure longevity.

Continuous full computational load leading to GPU temperatures around 80 °C is not unusual and not alarming per se. What temperatures are typically reached depends on the GPU and the specifics of the cooling solution. I am not familiar with the QUADRO RTX 8000, but as a high-end GPU I would expect it to run hot.

GPUs have built-in thermal protection. There are various thresholds to that effect. For example, on one of my GPUs, there is GPU Target Temperature set to 83 °C, GPU Slowdown Temp set to 101 °C, and GPU Shutdown Temp set to 104 °C. Running nvidia-smi -q can show these limits. If the limits are crossed, the GPU will first be slowed down, and in the worst case halt.

Generally speaking, all semiconductor devices age physically, and this ageing process is accelerated at higher operating temperatures. Semiconductor devices are therefore designed with engineering margins and temperature limits designed to ensure that the device remains fully operable over the intended lifetime and duty cycle. This is a statistical computation, i.e. it is a question of probabilities, not certainty. I don’t know what targets are used for the design of Quadro GPUs, but one might assume a duty cyle of 100% (since it is a professional grade GPU) and a target lifetime of 5+ years. In the case of GPUs, the temperature limits are monitored and enforced.

From observation, hitting the lowest thermal limit on a GPU will reduce or eliminate the use of clock boosting, i.e. the operating frequency of the GPU is limited to reduce power draw and thus thermal load. A reduction in operating frequency is accompanied by a reduction in computation performance. Look for Thermal Slowdown in the output of nvidia-smi -q to find out whether this is the case. If your system is affected, check the Fan Speed reported by nvidia-smi. It should obviously be greater than 0%, but it may not reach 100% even under full load depending on ambient temperature. Check the airflow around the GPU in the case (enclosure). Is it obstructed by cabling or other PCIe cards? Over time, fans and heat sink fins often become coated in a layer of dust which impedes heat removal. This can be cleaned away using a can of compressed air.

Since GPU temperature is a function of power dissipation, you could also experiment with the power limit of the GPU, for example by setting the Requested Power Limit lower than the Max Power Limit displayed by nvidia-smi. There is a non-linear relationship between power draw and temperature, and reducing the power limit by, say 20%, may have only a small negative impact on overall performance. You would have to determine experimentally what works best for your use case.

Thank you! That was very helpful. I found out that my GPUs target temperature is 84 C. Do you think there is any advantage in decreasing the target temperature a little? I do deep learning with 3D images which can take a long time to optimize.

To my knowledge GPU target temperature is not a user-settable limit. Other than improving cooling, e.g. by lower ambient temperature, you can experiment with the GPU power limit which is settable via nvidia-smi to indirectly influence the temperature. Setting the power limit likely requires administrator privileges.

[Later:] I belatedly noticed that the capability for users to set GPU target temperature via nvidia-smi was added some years ago, with the -gtt command-line switch. As I was not aware of that switch until just now, I have zero experience with the use of this switch.

Running a GPU at low temperatures may offer performance advantages by allowing a high clock boost. The nominal operating frequency for a GPU maybe 1500 MHz, and the maximum possible clock boost 1850 MHz, but by observation that maximum clock boost is generally unachievable if GPU temperature is above 60°C. Reducing GPU operating temperature slightly is likely to cause an insignificant improvement in GPU clock boosting.

As I stated earlier, the ageing processes in semiconductor devices accelerate with higher temperatures and slow down with lower temperatures, so people who intend to keep their semiconductor devices functional for longer than usual, say 10 years, might be interested in running their equipment cool, increasing the likelihood of survival. In my personal experience (= anecdotal evidence), power supplies, rotational storage, and system DRAM are usually the first components to fail in an aging system.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.