We are facing some GPU hardware thermal problems and we need some hardware support or advice/explanation/suggestion/anything.
Tesla V100 GPU thermal causing shutdown even it’s doing nothing… (BTW, We have 2 Tesla V100 and tested both)
When we power on our workstation (server), Tesla V100 instantly increases to above 90 C (Without any reason)!
And we also tried 1 power cable with a splitter and 2 separate power cables for Tesla V100 (Same problem).
But when we try GeForce RTX 2070, there is no such kind of problem.
We want to solve these problems immediately. We are using this workstation for Deep Learning.
So can you suggest a way to solve this problem?
Our workstation specs:
Mainboard: ASUS C621E SAGE
Power supply: Seasonic PRIME 1300 Platinum - SSR-1300PD Full Platinum
CPU: 2x Intel Xeon Scalable Gold 6230 Processor
FAN: 2x NOCTUA NH-U12S DX-3647
RAM: 8x Samsung DDR4 32GB PC4-21300
SSD: Samsung 970 EVO Plus series 2TB M.2
Case: 3RSYS T1000
OS: Ubuntu 18.04
Driver Version: 455.45.01
Others: 24pin + 8 pin power cables, USB mouse, keyboard, monitor
The Tesla doesn’t have active cooling, it depends on the chassis to provide fans for the necessary airflow. If you’re using it in a workstation chassis, you’ll have to get some add-on fans for it.
Yeah, but what I don’t understand is why GPU gets hotter even doing nothing…?
At least, it should be a normal/room temperature???
There are no processes for GPU (even no GUI).
There’s definitely somthing wrong, the gpus are running at P0 constantly. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Not really anything visible but I have a suspicion:
You previously had a 2080 with monitor connected and the system configured to run an Xserver on it. The nvidia config is still there, so if you didn’t disable the Xserver/Desktop from starting, it will
start on the Tesla
find no monitor
stop
in a loop in fast succession, hammering the Tesla. Please check if that is the case, e.g. by reconfiguring X to use the onboard ASPEED graphics.
Furthermore, it doesn’t seem you’re using the persistence daemon (nvidia-persistenced) but the depreciated persistence mode. Shouldn’t make a difference but better stay on recommended setup.