V100 GPU on new workstation getting very warm when idle

Hi,

I have just built a new workstation PC that includes an NVidia V100 GPU (PCI card), using Pop! OS Linux. On turning on the workstation for the first time, the V100 card seems to be getting quite warm to the touch, despite (supposedly) not being in use for any calculations. I thought I may have been supplying too much power via two PCI power cables, so I tried unplugging those and rebooting. However, even just on the PCI slot power, the GPU is still getting noticeably warm.

The issue seems like it may be similar to the one reported previously here:

https://forums.developer.nvidia.com/t/tesla-v100-gpu-thermal-causing-shutdown-even-its-doing-nothing/163791/10

That seemed to relate to X windows trying to use the V100 for graphics, despite it not having a graphical output. So, I tried booting straight into the Linux CLI, to ensure the X window system is not being started, but I still experience the same gradual heating up of the GPU card on idle.

Can you please advise on how I can troubleshoot this?

Another thing to mention is that the V100 GPU is being listed in lspci; however when I run nvidia-smi, it says there are no devices found.

Thanks in advance!

Attaching the nvidia-bug-report.log file.
nvidia-bug-report.log.gz (479.7 KB)

Also, I tried attaching just one of the PCI power cables to the card and now I do get an output from nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           Off |   00000000:04:00.0 Off |                    0 |
| N/A   48C    P0             27W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The card seems to be running at P0 level (as in the previous example). It’s at 27W and the temp. is steadily increasing.

(btw, how hot is this going to be, when it’s running at 250W?!)

Ok, so according to the answer here:

https://forums.developer.nvidia.com/t/why-the-pstate-of-v100-does-not-change-always-p0/116746/3

the V100 only has performance state P0, which is the highest and implies that a fairly high power output should be expected, even when idle.

Is there any documentation anywhere for this card, that clearly explains the usage, performance levels, etc? I haven’t used a passively-cooled GPU before, and I would be very interested to know things like ‘how hot is too hot?’, ‘what sort of cooling setup is recommended?’. At what temp. will I start to see a loss of performance due to throttling?

Do nvidia-smi -q. It will spit out the majority of the info you’re looking for.

I’ve never used this GPU so I can’t give specifics to it, but being in performance state 0 doesn’t necessarily mean it’s running at its highest graphics clocks. At least on an GeForce, the graphics clock can reduce to or near to as low as a higher performance state.

As for temps, I would say anything near 85c is a bit too hot for my taste. The driver will probably let it get to 95c+.

1 Like

First of all, please read this:
https://forums.developer.nvidia.com/t/nvidia-gpus-on-ubuntu-22-04-lts-one-gpu-keeps-disappearing-after-installing-nvidia-driver/289770/6
Second, please configure nvidia-persistenced to start on boot for the gpu to reach P8 while idle.

1 Like

@generix

Thanks for your reply. I have checked and nvidia-persistenced is installed and configured to be started on boot (as a systemd service). It seemed to be set up to not allow persistenced mode, so I have changed that in the .service file and nvidia-smi now shows the GPU is in persistenced mode. However, it still seems to be stuck on the P0 state and I’m seeing the same issue with the idle temp creeping up.

So, you’re saying the reply in that other thread I linked is incorrect? (the one that claimed the V100 only has a P0 state and can’t access the lower states)

@BlueGoliath

Yes, it seems I will have to give some more thought to the cooling situation, as I haven’t used a passively-cooled card like this before and I am quite surprised at a) how much power is being generated on idle and b) how fast the temp is increasing on a little over 10% of the max. power.

Right now, I have had it on idle for about 20 mins - it’s at 80C and for some reason the power draw seems to be increasing as the temp. increases.

What I am thinking of doing is getting a high-speed case fan and 3D-printing some adapters so I can duct the output of that fan directly through the V100 card, which will hopefully help with the cooling.

Also, can I confirm what the power connections are supposed to be? For this adapter that goes from 2x PCI to 1x CPU input on the card, should that be using 2 separate PCI power cables, or a single daisy-chained PCI cable, with both of the daisy-chained connectors going into the adapter?

Generally it’s recommended that you use two separate cables but with a good power supply it should be fine with a daisy chained cable.

More power = higher temps, all else being equal.

Yes, do that. If noise is an issue, I think Noctua sells 80mm fans if that’s the size you need.

Ok, thanks again for your advice. I will try that.

Fwiw, I did some further investigation and the P0-locking seems to be particular to the V100 card. We have a GTX 1080Ti in an older workstation in our lab. I swapped the V100 in the new workstation out for that and straight away the 1080 is going down to P8, 6W on idle. Curiously, the output of nvidia-smi -q shows that for the 1080, ‘Display Mode’ is ‘Inactive’; however, for the V100, ‘Display Mode’ shows as ‘Active’ (even just in CLI Linux, which doesn’t really make sense).

Yes, normal GeForce cards have multiple performance states. NVML bugs are a plenty but it’s good generally.

1 Like

Passive Tesla cards are designed to be used in enclosures that ensure adequate airflow is provided.

I cannot locate the data sheet for the V100, but the document for the older P100 Tesla is here.

The P100 is the same form factor and is also a 250W card, so the cooling requirement should be the same and ducting is probably going to be required.

Thanks for that link - I hadn’t seen that and it is very helpful. I have ordered a high-speed PWM fan, which I plan to duct straight onto the intake of the V100 card, as well as a thermal controller to directly measure the card temp. Based on that document, it looks like the airflow from that should be sufficient (fingers-crossed!).