Hello,
suddenly one of the GPUs of our DGX Station (v100) gets very hot even when idling.
I disabled the GPU right after booting with the following commands:
nvidia-smi -i 00000000:0F:00.0 -pm 0
nvidia-smi drain -p 0000:0F:00.0 -m 1
However, after about 10 minutes the GPU temperature still rises above 80 °C whereby the other 3 GPUs stay at 40-45 °C.
After some time even nvidia-smi fails to get information from the GPU with the following error:
Unable to determine the device handle for GPU0000:0F:00.0: Unknown Error
The kernel log is full with the following lines:
[ 1259.227028] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 0 to remote PCI:0000:0f:00
[ 1259.227047] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 1 to remote PCI:0000:0f:00
[ 1259.227219] NVRM: Xid (PCI:0000:08:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 2 to remote PCI:0000:0f:00
[ 1259.227294] NVRM: Xid (PCI:0000:0e:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 3 to remote PCI:0000:0f:00
I assume the GPU just died. Is there a way to remove it from the main board, or is there a way to disable the PCIe slot where it is connected to?
For me it is not obvious how to remove one of the GPUs because all 4 GPUs are stacked together somehow and I am afraid to cause even more damage.
I couldn’t find some kind of service guide that explains how to replace a faulty GPU.
Unfortunately just removing the power connectors of the GPU does not work because the system refuses to boot if it detects a GPU in the PCIe slot that does not have the power cables connected.
Any help is highly appreciated.
Best regards,
Daniel
This video helped me to understand how the GPUs are attached together and I was finally able to remove the faulty GPU :)
Now I am wondering if I can simply detach the water cooling tubes or would this
damage the water cooling circuit?
Hi, Can you share more on how you removed one GPU? especially how to remove the nvlink bridge and cooling tube?
Hi, sure.
NVLink bridge:
After removing all screws, you can simply pull the whole block horizontally towards you.
The bridge consists of a PCB with 4 PCIe like connectors which connect all the GPUs together.
The NVLink bridge fits quite tight, so you have to pull a little bit harder.
Cooling tubes:
Put your fingers around the ribbed ring of the connectors and pull them from the right side of the PC case. No water will drain out as the connectors have an integrated valve.
Hope this helps.
Do you have similar issues with your GPUs? Let me know if you found a solution to fix or replace the GPUs. Our Stations is only running with 2 GPUs left :( A pity that Nvidia does not offer replacement GPUs.
Daniel
Hello,
Thanks a lot for all the info already. This is helpful as we have the same issue currently.
I was able to remove the bridge and unplug the watercooling system of our overheating V100, but I cannot remove the card
- I did remove the screws on the right side of the card (to remove the vertical bar blocking the cables, and also the single screw attaching the card
- i did remove the screw on the left attaching the bracket to the case
Our faulty card is the one completely at the bottom. Looks like there is a security latch for the PCI slot but I am not sure and I dont want to force without being sure…
Thanks in advance
Pierre
Indeed, was able to detach the V100 card once the latch at the end of the PCI slot was pushed.
Any clue on why the card was overheating ? (thermal paste ? just aging ?)
Best regards
FWIW, a quick update on that.
- even without any job, the V100 would start around 70* and then heat up until they would switch off / protection
- we did remove all the cards, flush the liquid coolant
- dissassemble all the cards completely (not too difficult in fact)
- inside the copper block, it was alll oxydated/clogged (so no fluid flowing, hence the high temperature)
- clean up with toothbrush & white winegar (copper part only :-) ) - removed the rubber joints before doing that
- put back new thermal paste / termal pads and liquid coolant
- all is fine now at 26* when idle
PGR
I also encountered a similar situation. Currently, out of the four GPUs, three heat up rapidly after booting, and within 10 minutes, they become non-functional. Your approach appears very thorough. May I ask which brand of coolant you used when refilling? If I disassemble and reassemble the system myself using screws, is there a risk of leakage? I would like to try it as well. Thank you very much!