I have just built a new workstation PC that includes an NVidia V100 GPU (PCI card), using Pop! OS Linux. On turning on the workstation for the first time, the V100 card seems to be getting quite warm to the touch, despite (supposedly) not being in use for any calculations. I thought I may have been supplying too much power via two PCI power cables, so I tried unplugging those and rebooting. However, even just on the PCI slot power, the GPU is still getting noticeably warm.
The issue seems like it may be similar to the one reported previously here:
That seemed to relate to X windows trying to use the V100 for graphics, despite it not having a graphical output. So, I tried booting straight into the Linux CLI, to ensure the X window system is not being started, but I still experience the same gradual heating up of the GPU card on idle.
Can you please advise on how I can troubleshoot this?
Another thing to mention is that the V100 GPU is being listed in lspci; however when I run nvidia-smi, it says there are no devices found.
the V100 only has performance state P0, which is the highest and implies that a fairly high power output should be expected, even when idle.
Is there any documentation anywhere for this card, that clearly explains the usage, performance levels, etc? I haven’t used a passively-cooled GPU before, and I would be very interested to know things like ‘how hot is too hot?’, ‘what sort of cooling setup is recommended?’. At what temp. will I start to see a loss of performance due to throttling?
Do nvidia-smi -q. It will spit out the majority of the info you’re looking for.
I’ve never used this GPU so I can’t give specifics to it, but being in performance state 0 doesn’t necessarily mean it’s running at its highest graphics clocks. At least on an GeForce, the graphics clock can reduce to or near to as low as a higher performance state.
As for temps, I would say anything near 85c is a bit too hot for my taste. The driver will probably let it get to 95c+.
Thanks for your reply. I have checked and nvidia-persistenced is installed and configured to be started on boot (as a systemd service). It seemed to be set up to not allow persistenced mode, so I have changed that in the .service file and nvidia-smi now shows the GPU is in persistenced mode. However, it still seems to be stuck on the P0 state and I’m seeing the same issue with the idle temp creeping up.
So, you’re saying the reply in that other thread I linked is incorrect? (the one that claimed the V100 only has a P0 state and can’t access the lower states)
Yes, it seems I will have to give some more thought to the cooling situation, as I haven’t used a passively-cooled card like this before and I am quite surprised at a) how much power is being generated on idle and b) how fast the temp is increasing on a little over 10% of the max. power.
Right now, I have had it on idle for about 20 mins - it’s at 80C and for some reason the power draw seems to be increasing as the temp. increases.
What I am thinking of doing is getting a high-speed case fan and 3D-printing some adapters so I can duct the output of that fan directly through the V100 card, which will hopefully help with the cooling.
Also, can I confirm what the power connections are supposed to be? For this adapter that goes from 2x PCI to 1x CPU input on the card, should that be using 2 separate PCI power cables, or a single daisy-chained PCI cable, with both of the daisy-chained connectors going into the adapter?
Ok, thanks again for your advice. I will try that.
Fwiw, I did some further investigation and the P0-locking seems to be particular to the V100 card. We have a GTX 1080Ti in an older workstation in our lab. I swapped the V100 in the new workstation out for that and straight away the 1080 is going down to P8, 6W on idle. Curiously, the output of nvidia-smi -q shows that for the 1080, ‘Display Mode’ is ‘Inactive’; however, for the V100, ‘Display Mode’ shows as ‘Active’ (even just in CLI Linux, which doesn’t really make sense).
Thanks for that link - I hadn’t seen that and it is very helpful. I have ordered a high-speed PWM fan, which I plan to duct straight onto the intake of the V100 card, as well as a thermal controller to directly measure the card temp. Based on that document, it looks like the airflow from that should be sufficient (fingers-crossed!).