Tesla K80 overheating

bclark · July 10, 2015, 2:28pm

We are running an app on a K80 that for the first 2-3 minutes does just fine - but the temperature of one of the GPUs goes up steadily to 90C after 3 minutes and the clock speeds then throttle to between a third to an eighth of what they were. There is only a passive heat sink. Has anyone else overcome this hurdle?
TIA.

$ nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 340.32 Driver Version: 340.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 91C P0 110W / 149W | 940MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 63C P0 120W / 149W | 940MiB / 11519MiB | 64% Default |
+-------------------------------+----------------------+----------------------+

bclark · July 10, 2015, 2:38pm

this is how it looks in normal operation before throttling induces overruns:

+------------------------------------------------------+                       
| NVIDIA-SMI 340.32     Driver Version: 340.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   75C    P0   118W / 149W |    793MiB / 11519MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   55C    P0   128W / 149W |    793MiB / 11519MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+

droettger · July 10, 2015, 3:04pm

Are you running it inside a certified server configuration?

Please read this thread and the links in them:
[url]https://devtalk.nvidia.com/default/topic/830470/?comment=4522937[/url]

bclark · July 10, 2015, 4:15pm

We have an iHawk GPU Workbench CUDA server built up [including the K80] by Concurrent Computer Corp and we are doing only CUDA, not graphics on the K80.

One of the links mentions:

So “certified” by who?

droettger · July 11, 2015, 1:57pm

Maybe I picked the wrong word. As you saw the first question when K80 server boards are involved on this developer forum is always if the server system was built to support the passive cooling solution, the required monitoring, BIOS, etc.

If you bought a full server system configuration from one vendor and the machine is not behaving like expected, then you should contact the system vendor first to determine if there isn’t any defect involved.

jeremyrutman · February 12, 2017, 5:02pm

I have an ‘encorr. ecc’ problem on my K80 that is preventing its use:

root@x:~# nvidia-smi
Sun Feb 12 11:00:53 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   66C    P0   105W / 149W |  10819MiB / 11439MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   53C    P0   145W / 149W |  10819MiB / 11439MiB |     85%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                    2 |
| N/A   43C    P8    29W / 149W |      2MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   54C    P0   151W / 149W |  10819MiB / 11439MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14252    C   python solve_jr.py 0                         10815MiB |
|    1      9625    C   python solve_jr.py 1                         10815MiB |
|    3     15555    C   python solve_jr.py 3                         10815MiB |
+-----------------------------------------------------------------------------+

Apparently some sort of errror has occurred on gpu #2, should i turn off ecc , reboot the machine, or what?

Topic		Replies	Views
Tesla Temperature Monitoring CUDA Programming and Performance	17	6451	May 8, 2024
Tesla K80 cannot work properly in Ubuntu 14.04 Linux	0	851	November 9, 2015
GPU 0 Overheating if >1 Tesla K80 Installed Tesla Boards	2	1991	May 27, 2021
Heating up of K80 CUDA Setup and Installation	3	2365	April 10, 2018
K80 GPU0 overheat in compatible server CUDA Setup and Installation	0	206	May 6, 2024
Tesla K80 Initital Setup Problem Tesla Boards	4	8750	February 18, 2021
Hardware compatibility CUDA Setup and Installation	2	5351	April 9, 2015
cannot install driver correctly for tesla k80 CUDA Setup and Installation	3	2805	August 31, 2020
Tesla K80 stopped working CUDA Setup and Installation	17	5107	November 12, 2023
The way to prevent overheat GPU CUDA Programming and Performance	1	3092	February 20, 2019

Tesla K80 overheating

Related topics