V100 GPU on new workstation getting very warm when idle

lstrobel1 · April 25, 2024, 1:51pm

Hi,

I have just built a new workstation PC that includes an NVidia V100 GPU (PCI card), using Pop! OS Linux. On turning on the workstation for the first time, the V100 card seems to be getting quite warm to the touch, despite (supposedly) not being in use for any calculations. I thought I may have been supplying too much power via two PCI power cables, so I tried unplugging those and rebooting. However, even just on the PCI slot power, the GPU is still getting noticeably warm.

The issue seems like it may be similar to the one reported previously here:

https://forums.developer.nvidia.com/t/tesla-v100-gpu-thermal-causing-shutdown-even-its-doing-nothing/163791/10

That seemed to relate to X windows trying to use the V100 for graphics, despite it not having a graphical output. So, I tried booting straight into the Linux CLI, to ensure the X window system is not being started, but I still experience the same gradual heating up of the GPU card on idle.

Can you please advise on how I can troubleshoot this?

Another thing to mention is that the V100 GPU is being listed in lspci; however when I run nvidia-smi, it says there are no devices found.

Thanks in advance!

lstrobel1 · April 25, 2024, 4:18pm

Attaching the nvidia-bug-report.log file.
nvidia-bug-report.log.gz (479.7 KB)

Also, I tried attaching just one of the PCI power cables to the card and now I do get an output from nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           Off |   00000000:04:00.0 Off |                    0 |
| N/A   48C    P0             27W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The card seems to be running at P0 level (as in the previous example). It’s at 27W and the temp. is steadily increasing.

(btw, how hot is this going to be, when it’s running at 250W?!)

lstrobel1 · April 26, 2024, 1:11am

Ok, so according to the answer here:

https://forums.developer.nvidia.com/t/why-the-pstate-of-v100-does-not-change-always-p0/116746/3

the V100 only has performance state P0, which is the highest and implies that a fairly high power output should be expected, even when idle.

Is there any documentation anywhere for this card, that clearly explains the usage, performance levels, etc? I haven’t used a passively-cooled GPU before, and I would be very interested to know things like ‘how hot is too hot?’, ‘what sort of cooling setup is recommended?’. At what temp. will I start to see a loss of performance due to throttling?

BlueGoliath · April 26, 2024, 4:30am

Do nvidia-smi -q. It will spit out the majority of the info you’re looking for.

I’ve never used this GPU so I can’t give specifics to it, but being in performance state 0 doesn’t necessarily mean it’s running at its highest graphics clocks. At least on an GeForce, the graphics clock can reduce to or near to as low as a higher performance state.

As for temps, I would say anything near 85c is a bit too hot for my taste. The driver will probably let it get to 95c+.

generix · April 26, 2024, 7:26am

First of all, please read this:
https://forums.developer.nvidia.com/t/nvidia-gpus-on-ubuntu-22-04-lts-one-gpu-keeps-disappearing-after-installing-nvidia-driver/289770/6
Second, please configure nvidia-persistenced to start on boot for the gpu to reach P8 while idle.

lstrobel1 · April 26, 2024, 2:38pm

@generix

Thanks for your reply. I have checked and nvidia-persistenced is installed and configured to be started on boot (as a systemd service). It seemed to be set up to not allow persistenced mode, so I have changed that in the .service file and nvidia-smi now shows the GPU is in persistenced mode. However, it still seems to be stuck on the P0 state and I’m seeing the same issue with the idle temp creeping up.

So, you’re saying the reply in that other thread I linked is incorrect? (the one that claimed the V100 only has a P0 state and can’t access the lower states)

lstrobel1 · April 26, 2024, 2:44pm

@BlueGoliath

Yes, it seems I will have to give some more thought to the cooling situation, as I haven’t used a passively-cooled card like this before and I am quite surprised at a) how much power is being generated on idle and b) how fast the temp is increasing on a little over 10% of the max. power.

Right now, I have had it on idle for about 20 mins - it’s at 80C and for some reason the power draw seems to be increasing as the temp. increases.

What I am thinking of doing is getting a high-speed case fan and 3D-printing some adapters so I can duct the output of that fan directly through the V100 card, which will hopefully help with the cooling.

lstrobel1 · April 26, 2024, 2:46pm

Also, can I confirm what the power connections are supposed to be? For this adapter that goes from 2x PCI to 1x CPU input on the card, should that be using 2 separate PCI power cables, or a single daisy-chained PCI cable, with both of the daisy-chained connectors going into the adapter?

BlueGoliath · April 26, 2024, 10:31pm

Generally it’s recommended that you use two separate cables but with a good power supply it should be fine with a daisy chained cable.

BlueGoliath · April 26, 2024, 10:35pm

More power = higher temps, all else being equal.

Yes, do that. If noise is an issue, I think Noctua sells 80mm fans if that’s the size you need.

lstrobel1 · April 27, 2024, 12:40pm

Ok, thanks again for your advice. I will try that.

Fwiw, I did some further investigation and the P0-locking seems to be particular to the V100 card. We have a GTX 1080Ti in an older workstation in our lab. I swapped the V100 in the new workstation out for that and straight away the 1080 is going down to P8, 6W on idle. Curiously, the output of nvidia-smi -q shows that for the 1080, ‘Display Mode’ is ‘Inactive’; however, for the V100, ‘Display Mode’ shows as ‘Active’ (even just in CLI Linux, which doesn’t really make sense).

BlueGoliath · April 28, 2024, 12:35am

Yes, normal GeForce cards have multiple performance states. NVML bugs are a plenty but it’s good generally.

rs277 · May 2, 2024, 4:42am

Passive Tesla cards are designed to be used in enclosures that ensure adequate airflow is provided.

I cannot locate the data sheet for the V100, but the document for the older P100 Tesla is here.

The P100 is the same form factor and is also a 250W card, so the cooling requirement should be the same and ducting is probably going to be required.

lstrobel1 · May 2, 2024, 1:48pm

Thanks for that link - I hadn’t seen that and it is very helpful. I have ordered a high-speed PWM fan, which I plan to duct straight onto the intake of the V100 card, as well as a thermal controller to directly measure the card temp. Based on that document, it looks like the airflow from that should be sufficient (fingers-crossed!).

Topic		Replies	Views
Tesla V100 GPU thermal causing shutdown even it's doing nothing Linux boot , kernel , ubuntu	10	1636	December 17, 2020
Tesla V100 PCIE fails after some time on Ubuntu 18.04 Linux	1	1349	January 29, 2019
Elevated Power at Idle - 410.78 - Headless Linux	5	876	December 29, 2018
The GPU FAN runs heavily after the process is done. CUDA Setup and Installation	19	4987	July 20, 2017
Tesla c2050 Idle Power Consumption CUDA Programming and Performance	11	4500	April 29, 2011
GPU temperature keeps increasing just with a single memory allocation. CUDA 4.0 + CUDA Programming and Performance	7	15502	April 6, 2011
Tesla K20c remains in P0 power state when idle CUDA Programming and Performance	2	1451	August 11, 2016
2080 Ti on Ubuntu 18.04 / Permanently in Power Mode P0 Linux	5	1316	October 14, 2021
GPU cards warming up? CUDA Programming and Performance	1	3702	April 14, 2010
Any future problems running GPUs for 12+ hours at a time? running cards for long periods of time whi CUDA Programming and Performance	32	6006	September 24, 2010

V100 GPU on new workstation getting very warm when idle

Related topics