Hello,
suddenly one of the GPUs of our DGX Station (v100) gets very hot even when idling.
I disabled the GPU right after booting with the following commands:
nvidia-smi -i 00000000:0F:00.0 -pm 0
nvidia-smi drain -p 0000:0F:00.0 -m 1
However, after about 10 minutes the GPU temperature still rises above 80 °C whereby the other 3 GPUs stay at 40-45 °C.
After some time even nvidia-smi fails to get information from the GPU with the following error:
Unable to determine the device handle for GPU0000:0F:00.0: Unknown Error
The kernel log is full with the following lines:
[ 1259.227028] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 0 to remote PCI:0000:0f:00
[ 1259.227047] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 1 to remote PCI:0000:0f:00
[ 1259.227219] NVRM: Xid (PCI:0000:08:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 2 to remote PCI:0000:0f:00
[ 1259.227294] NVRM: Xid (PCI:0000:0e:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 3 to remote PCI:0000:0f:00
I assume the GPU just died. Is there a way to remove it from the main board, or is there a way to disable the PCIe slot where it is connected to?
For me it is not obvious how to remove one of the GPUs because all 4 GPUs are stacked together somehow and I am afraid to cause even more damage.
I couldn’t find some kind of service guide that explains how to replace a faulty GPU.
Unfortunately just removing the power connectors of the GPU does not work because the system refuses to boot if it detects a GPU in the PCIe slot that does not have the power cables connected.
Any help is highly appreciated.
Best regards,
Daniel