Tesla V100 GPU thermal causing shutdown even it's doing nothing

batkhuu10 · December 16, 2020, 7:07am

Hello,

We are facing some GPU hardware thermal problems and we need some hardware support or advice/explanation/suggestion/anything.
Tesla V100 GPU thermal causing shutdown even it’s doing nothing… (BTW, We have 2 Tesla V100 and tested both)
When we power on our workstation (server), Tesla V100 instantly increases to above 90 C (Without any reason)!
And we also tried 1 power cable with a splitter and 2 separate power cables for Tesla V100 (Same problem).
But when we try GeForce RTX 2070, there is no such kind of problem.
We want to solve these problems immediately. We are using this workstation for Deep Learning.
So can you suggest a way to solve this problem?

Our workstation specs:

Mainboard: ASUS C621E SAGE
Power supply: Seasonic PRIME 1300 Platinum - SSR-1300PD Full Platinum
CPU: 2x Intel Xeon Scalable Gold 6230 Processor
FAN: 2x NOCTUA NH-U12S DX-3647
RAM: 8x Samsung DDR4 32GB PC4-21300
SSD: Samsung 970 EVO Plus series 2TB M.2
Case: 3RSYS T1000
OS: Ubuntu 18.04
Driver Version: 455.45.01
Others: 24pin + 8 pin power cables, USB mouse, keyboard, monitor

Thank you.

generix · December 16, 2020, 7:38am

The Tesla doesn’t have active cooling, it depends on the chassis to provide fans for the necessary airflow. If you’re using it in a workstation chassis, you’ll have to get some add-on fans for it.

batkhuu10 · December 16, 2020, 8:35am

Yeah, but what I don’t understand is why GPU gets hotter even doing nothing…?
At least, it should be a normal/room temperature???
There are no processes for GPU (even no GUI).

generix · December 16, 2020, 8:42am

Please post the output of nvidia-smi to check idle power consumption.

generix · December 16, 2020, 8:43am

Also make sure that you enabled nvidia-persistenced to start on boot and that it continuously runs.

batkhuu10 · December 17, 2020, 2:47am

Yeah, we enabled nvidia-persistenced mode as you mentioned.

And GPU doesn’t doing anything but get keep hotter…

$nvidia-smi:

Idle power consumption:

Pwr: Usage/Cap
32W / 250W

Is it right?

batkhuu10 · December 17, 2020, 3:05am

It seems GPU power usage also keeps increasing after a few minutes.
After 15 minutes later:
57C P0 32W / 250W → 82C P0 48W / 250W

$ nvidia-smi:

generix · December 17, 2020, 8:08am

There’s definitely somthing wrong, the gpus are running at P0 constantly. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

batkhuu10 · December 17, 2020, 8:42am

BTW, currently we are testing on only one Tesla V100 GPU (not two).

nvidia-bug-report.log.gz (421.3 KB)
nvidia-bug-report.log (1.1 MB)

generix · December 17, 2020, 9:17am

Not really anything visible but I have a suspicion:
You previously had a 2080 with monitor connected and the system configured to run an Xserver on it. The nvidia config is still there, so if you didn’t disable the Xserver/Desktop from starting, it will
start on the Tesla
find no monitor
stop
in a loop in fast succession, hammering the Tesla. Please check if that is the case, e.g. by reconfiguring X to use the onboard ASPEED graphics.
Furthermore, it doesn’t seem you’re using the persistence daemon (nvidia-persistenced) but the depreciated persistence mode. Shouldn’t make a difference but better stay on recommended setup.

batkhuu10 · December 17, 2020, 9:39am

Okay, somehow that makes sense. We will check that one.

By your suggestion, preferring persistence daemon (nvidia-persistenced) - Got it :)

Thank you for your responses.

Topic		Replies	Views
V100 GPU on new workstation getting very warm when idle Linux	13	1003	May 2, 2024
Tesla V100 SW Thermal Slowdown active GPU-Accelerated Libraries cuda	1	1718	December 10, 2020
Tesla V100 PCIE fails after some time on Ubuntu 18.04 Linux	1	1333	January 29, 2019
GPU 0 Overheating if >1 Tesla K80 Installed Tesla Boards	2	1929	May 27, 2021
Nvidia Tesla V100 goes to 100% utilization and get stucked without any progress Linux cuda , nvidia-smi	0	157	October 28, 2024
No running processes found by NVIDIA Tesla P100, what could be the cause? CUDA Programming and Performance	8	4296	May 3, 2019
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	553	July 2, 2024
dynamic downclocking doesn't work CUDA Programming and Performance	2	1574	March 4, 2011
Tesla K80 overheating Linux	5	6915	February 12, 2017
Compute Card Fails when switching jobs CUDA Setup and Installation	8	623	February 23, 2020

Tesla V100 GPU thermal causing shutdown even it's doing nothing

Related topics