The way to prevent overheat GPU

hyungmin.chang · February 20, 2019, 2:13am

I use a GPU for deep learning simulation.
The GPU is NVIDIA Tesla K20m and it has a large heatsink but no cooling fan.

When I simulate deep leaning with python for more than 5 minutes,
the temperature of GPU is over 94 degree Celsius(201 degree Fahrenheit).
I put the GPU in normal desktop PC with 6 system coolers at my office.
Sometimes I found a message like “GPU driver can not be found” during performing ‘watch -n 10 nvidia-smi’.

Is there any solution for changing mode like from ‘full performance mode’ to ‘prevent overheat mode’?

p.s. I try to put more cooling fans around the GPU.
One way to put cooling fans under the GPU on the PCI slot.
The other way, I’ll buy a blow fan and shroud duct for the GPU.

njuffa · February 20, 2019, 3:00am

The Tesla K20m is a passively-cooled GPU designed for installation in a server enclosure that provides adequate air flow, typically provided by banks of high-RPM fans and appropriate ducting. Integrators know how to provide sufficient airflow in such an enclosure and they sell servers with the GPU pre-installed. NVIDIA maintains a list of partners that sell such systems: [url]https://www.nvidia.com/en-us/data-center/where-to-buy-tesla/[/url]

The Tesla K20m is not a product targeted at consumers and support for it is designed to be provided by the integrators that sell systems with this GPU. For workstations, there is the actively-cooled K20c that has an integrated fan and a matching shroud.

If you search the internet, you should be able to find people showing off their projects building ducts and funnels from sheet metal or other materials that they attach to the K20m and other passively-cooled GPUs, and then use large fans to direct massive airflow over the heatsinks of those GPUs. In at least some cases, that seems to work. Here is an example using a Tesla K80 (this is not an endorsement on my part, proceed at your own risk):

[url]How to cool the nVidia Tesla K80 Cheaply - YouTube

I assume your current approach using lots of fans is mostly just agitating the air and providing an insufficient and probably very turbulent air flow. What you need is a mostly laminar flow along the fins of the heatsink(s), moving an adequate volume of air per time unit. My memory is hazy, but about 50 cfm should do it.

Topic		Replies	Views
Tesla Temperature Monitoring CUDA Programming and Performance	17	6163	May 8, 2024
How much cooling is required for a K40c ? CUDA Setup and Installation	3	3160	July 23, 2015
T4 thermal integration GPU - Hardware	5	7934	February 12, 2020
GPU 0 Overheating if >1 Tesla K80 Installed Tesla Boards	2	1923	May 27, 2021
K40m CUDA Compatibility CUDA Programming and Performance	9	1835	July 8, 2021
Problem with M2090 CUDA Setup and Installation	5	2485	August 22, 2016
Cannot install driver for NVIDIA tesla k40 cards on Fedora 20 CUDA Setup and Installation	15	14199	April 8, 2019
nVidia Tesla A10 normal working temperature and required air flow through the GPU case in cubic feet of air per minute (CFM units) Miscellaneous Products (archived)	2	2585	March 17, 2023
M2050 cooling Passive cards in a non-server case CUDA Programming and Performance	16	10477	June 28, 2017
Controling fan speed of Titan and TitanX with TCC enabled CUDA Programming and Performance	15	5239	December 5, 2022

The way to prevent overheat GPU

Related topics