DGX Workstation V100 overheats one of the GPUs


We started to run into overheating issues with GPU3 on a DGX V100 workstation, that has liquid cooled stack of 4xTesla V100-DGXS-32GB that are NVlink connected.

During complete idle GPU3 heat soaks and gets to 80+C and 53W power draw, while the rest seats on 32-33C and 38W.

Is there a way to at least disable the faulty GPU?

nvidia-bug-report.log.gz (2.0 MB)

The following disables a GPU, making it invisible, so that it’s not on the list of CUDA devices you can find (and it doesn’t take up a device index so if you had 4 GPUs 0,1,2,3 and you disabled GPU2 you would find 0,1,2 instead of 0,1,3)

nvidia-smi -i 0000:xx:00.0 -pm 0
nvidia-smi drain -p 0000:xx:00.0 -m 1
where xx is the PCI device ID of your GPU. You can determine that using lspci | grep NVIDIA or nvidia-smi.

The device will still be visible with lspci after running the commands above.

Source: linux - How can I disable (and later re-enable) one of my NVIDIA GPUs? - Unix & Linux Stack Exchange