NVIDIA DGX Station (V100): One GPU gets very hot after booting

Hello,

suddenly one of the GPUs of our DGX Station (v100) gets very hot even when idling.
I disabled the GPU right after booting with the following commands:

nvidia-smi -i 00000000:0F:00.0 -pm 0
nvidia-smi drain -p 0000:0F:00.0 -m 1

However, after about 10 minutes the GPU temperature still rises above 80 °C whereby the other 3 GPUs stay at 40-45 °C.
After some time even nvidia-smi fails to get information from the GPU with the following error:
Unable to determine the device handle for GPU0000:0F:00.0: Unknown Error

The kernel log is full with the following lines:

[ 1259.227028] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 0 to remote PCI:0000:0f:00
[ 1259.227047] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 1 to remote PCI:0000:0f:00
[ 1259.227219] NVRM: Xid (PCI:0000:08:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 2 to remote PCI:0000:0f:00
[ 1259.227294] NVRM: Xid (PCI:0000:0e:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 3 to remote PCI:0000:0f:00

I assume the GPU just died. Is there a way to remove it from the main board, or is there a way to disable the PCIe slot where it is connected to?

For me it is not obvious how to remove one of the GPUs because all 4 GPUs are stacked together somehow and I am afraid to cause even more damage.
I couldn’t find some kind of service guide that explains how to replace a faulty GPU.

Unfortunately just removing the power connectors of the GPU does not work because the system refuses to boot if it detects a GPU in the PCIe slot that does not have the power cables connected.

Any help is highly appreciated.

Best regards,
Daniel

This video helped me to understand how the GPUs are attached together and I was finally able to remove the faulty GPU :)

Now I am wondering if I can simply detach the water cooling tubes or would this
damage the water cooling circuit?

Hi, Can you share more on how you removed one GPU? especially how to remove the nvlink bridge and cooling tube?

Hi, sure.

NVLink bridge:
After removing all screws, you can simply pull the whole block horizontally towards you.
The bridge consists of a PCB with 4 PCIe like connectors which connect all the GPUs together.
The NVLink bridge fits quite tight, so you have to pull a little bit harder.

Cooling tubes:
Put your fingers around the ribbed ring of the connectors and pull them from the right side of the PC case. No water will drain out as the connectors have an integrated valve.

Hope this helps.

Do you have similar issues with your GPUs? Let me know if you found a solution to fix or replace the GPUs. Our Stations is only running with 2 GPUs left :( A pity that Nvidia does not offer replacement GPUs.

Daniel