Tesla V100 GPU thermal causing shutdown even it's doing nothing

Hello,

We are facing some GPU hardware thermal problems and we need some hardware support or advice/explanation/suggestion/anything.
Tesla V100 GPU thermal causing shutdown even it’s doing nothing… (BTW, We have 2 Tesla V100 and tested both)
When we power on our workstation (server), Tesla V100 instantly increases to above 90 C (Without any reason)!
And we also tried 1 power cable with a splitter and 2 separate power cables for Tesla V100 (Same problem).
But when we try GeForce RTX 2070, there is no such kind of problem.
We want to solve these problems immediately. We are using this workstation for Deep Learning.
So can you suggest a way to solve this problem?

Our workstation specs:

  • Mainboard: ASUS C621E SAGE
  • Power supply: Seasonic PRIME 1300 Platinum - SSR-1300PD Full Platinum
  • CPU: 2x Intel Xeon Scalable Gold 6230 Processor
  • FAN: 2x NOCTUA NH-U12S DX-3647
  • RAM: 8x Samsung DDR4 32GB PC4-21300
  • SSD: Samsung 970 EVO Plus series 2TB M.2
  • Case: 3RSYS T1000
  • OS: Ubuntu 18.04
  • Driver Version: 455.45.01
  • Others: 24pin + 8 pin power cables, USB mouse, keyboard, monitor

Thank you.

The Tesla doesn’t have active cooling, it depends on the chassis to provide fans for the necessary airflow. If you’re using it in a workstation chassis, you’ll have to get some add-on fans for it.

1 Like

Yeah, but what I don’t understand is why GPU gets hotter even doing nothing…?
At least, it should be a normal/room temperature???
There are no processes for GPU (even no GUI).

Please post the output of nvidia-smi to check idle power consumption.

Also make sure that you enabled nvidia-persistenced to start on boot and that it continuously runs.

Yeah, we enabled nvidia-persistenced mode as you mentioned.

And GPU doesn’t doing anything but get keep hotter…

$nvidia-smi:

Thu Dec 17 11:45:10 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… On | 00000000:D8:00.0 Off | 0 |
| N/A 57C P0 32W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Idle power consumption:

Pwr: Usage/Cap
32W / 250W

Is it right?

It seems GPU power usage also keeps increasing after a few minutes.
After 15 minutes later:
57C P0 32W / 250W → 82C P0 48W / 250W

$ nvidia-smi:

Thu Dec 17 12:01:11 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… On | 00000000:D8:00.0 Off | 0 |
| N/A 82C P0 48W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

There’s definitely somthing wrong, the gpus are running at P0 constantly. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

BTW, currently we are testing on only one Tesla V100 GPU (not two).

nvidia-bug-report.log.gz (421.3 KB)
nvidia-bug-report.log (1.1 MB)

Not really anything visible but I have a suspicion:
You previously had a 2080 with monitor connected and the system configured to run an Xserver on it. The nvidia config is still there, so if you didn’t disable the Xserver/Desktop from starting, it will
start on the Tesla
find no monitor
stop
in a loop in fast succession, hammering the Tesla. Please check if that is the case, e.g. by reconfiguring X to use the onboard ASPEED graphics.
Furthermore, it doesn’t seem you’re using the persistence daemon (nvidia-persistenced) but the depreciated persistence mode. Shouldn’t make a difference but better stay on recommended setup.

Okay, somehow that makes sense. We will check that one.

By your suggestion, preferring persistence daemon (nvidia-persistenced) - Got it :)

Thank you for your responses.