Hi,
We started to run into overheating issues with GPU3 on a DGX V100 workstation, that has liquid cooled stack of 4xTesla V100-DGXS-32GB that are NVlink connected.
During complete idle GPU3 heat soaks and gets to 80+C and 53W power draw, while the rest seats on 32-33C and 38W.
Is there a way to at least disable the faulty GPU?
nvidia-bug-report.log.gz (2.0 MB)
The following disables a GPU, making it invisible, so that it’s not on the list of CUDA devices you can find (and it doesn’t take up a device index so if you had 4 GPUs 0,1,2,3 and you disabled GPU2 you would find 0,1,2 instead of 0,1,3)
nvidia-smi -i 0000:xx:00.0 -pm 0
nvidia-smi drain -p 0000:xx:00.0 -m 1
where xx is the PCI device ID of your GPU. You can determine that using lspci | grep NVIDIA or nvidia-smi.
The device will still be visible with lspci after running the commands above.
Source: linux - How can I disable (and later re-enable) one of my NVIDIA GPUs? - Unix & Linux Stack Exchange