NVIDIA DGX Station (V100): One GPU gets very hot after booting

Hello,

suddenly one of the GPUs of our DGX Station (v100) gets very hot even when idling.
I disabled the GPU right after booting with the following commands:

nvidia-smi -i 00000000:0F:00.0 -pm 0
nvidia-smi drain -p 0000:0F:00.0 -m 1

However, after about 10 minutes the GPU temperature still rises above 80 °C whereby the other 3 GPUs stay at 40-45 °C.
After some time even nvidia-smi fails to get information from the GPU with the following error:
Unable to determine the device handle for GPU0000:0F:00.0: Unknown Error

The kernel log is full with the following lines:

[ 1259.227028] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 0 to remote PCI:0000:0f:00
[ 1259.227047] NVRM: Xid (PCI:0000:07:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 1 to remote PCI:0000:0f:00
[ 1259.227219] NVRM: Xid (PCI:0000:08:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 2 to remote PCI:0000:0f:00
[ 1259.227294] NVRM: Xid (PCI:0000:0e:00): 74, pid='<unknown>', name=<unknown>, NVLink: Failed to train link 3 to remote PCI:0000:0f:00

I assume the GPU just died. Is there a way to remove it from the main board, or is there a way to disable the PCIe slot where it is connected to?

For me it is not obvious how to remove one of the GPUs because all 4 GPUs are stacked together somehow and I am afraid to cause even more damage.
I couldn’t find some kind of service guide that explains how to replace a faulty GPU.

Unfortunately just removing the power connectors of the GPU does not work because the system refuses to boot if it detects a GPU in the PCIe slot that does not have the power cables connected.

Any help is highly appreciated.

Best regards,
Daniel

This video helped me to understand how the GPUs are attached together and I was finally able to remove the faulty GPU :)

Now I am wondering if I can simply detach the water cooling tubes or would this
damage the water cooling circuit?

Hi, Can you share more on how you removed one GPU? especially how to remove the nvlink bridge and cooling tube?

Hi, sure.

NVLink bridge:
After removing all screws, you can simply pull the whole block horizontally towards you.
The bridge consists of a PCB with 4 PCIe like connectors which connect all the GPUs together.
The NVLink bridge fits quite tight, so you have to pull a little bit harder.

Cooling tubes:
Put your fingers around the ribbed ring of the connectors and pull them from the right side of the PC case. No water will drain out as the connectors have an integrated valve.

Hope this helps.

Do you have similar issues with your GPUs? Let me know if you found a solution to fix or replace the GPUs. Our Stations is only running with 2 GPUs left :( A pity that Nvidia does not offer replacement GPUs.

Daniel

Hello,

Thanks a lot for all the info already. This is helpful as we have the same issue currently.

I was able to remove the bridge and unplug the watercooling system of our overheating V100, but I cannot remove the card

  • I did remove the screws on the right side of the card (to remove the vertical bar blocking the cables, and also the single screw attaching the card
  • i did remove the screw on the left attaching the bracket to the case

Our faulty card is the one completely at the bottom. Looks like there is a security latch for the PCI slot but I am not sure and I dont want to force without being sure…

Thanks in advance

Pierre

Indeed, was able to detach the V100 card once the latch at the end of the PCI slot was pushed.

Any clue on why the card was overheating ? (thermal paste ? just aging ?)

Best regards

FWIW, a quick update on that.

  • even without any job, the V100 would start around 70* and then heat up until they would switch off / protection
  • we did remove all the cards, flush the liquid coolant
  • dissassemble all the cards completely (not too difficult in fact)
  • inside the copper block, it was alll oxydated/clogged (so no fluid flowing, hence the high temperature)
  • clean up with toothbrush & white winegar (copper part only :-) ) - removed the rubber joints before doing that
  • put back new thermal paste / termal pads and liquid coolant
  • all is fine now at 26* when idle

PGR

I also encountered a similar situation. Currently, out of the four GPUs, three heat up rapidly after booting, and within 10 minutes, they become non-functional. Your approach appears very thorough. May I ask which brand of coolant you used when refilling? If I disassemble and reassemble the system myself using screws, is there a risk of leakage? I would like to try it as well. Thank you very much!

Hi,

We did use Watercooling Ekwb EK-CryoFuel Premix 1L (Yellow one, but I uess any color is fine) Thats the one recommended by nvidia for this setup (the clear one, but could not order it).

Dissasembling everything was not too hard (that was our first experience with a watercooling system).

  • Removing the first card was a bit difficult (accessing the lock/security at the end of the pci slot since not much space).
  • Used a toothpick and patience to remove the rubber seals (also the ones on the side of the card). Please note where is belongs (not exactly the same size.
  • If you do one or two cards at a time, you always have the others to doublecheck where ieach screw belongs
  • once card is unplugged, dissasembling was easy (card first, then the cooling system.
  • you can use the tubes from a dismantled card, connect them again to the bottom plugs in the dgx station and purge the system
  • tried to purge as much as possible, then did put back coolant in the block
  • one cleaned and assembled again, connected the card just to the watercooling system (not to the pci slot) and turned on the station, the pump started turning and did push new coolant into the top radiators and into the card (had to refill the bloc several time). Do this for each card (without going too high or too low on the indicators on the bloc/reservoir inside the station
  • you can leave the cards on the side (not connected to the pci slot) while the pump/station is running for a bit of time and make sure there are no leaks

Thank you so much for such a detailed reply—it really gives me a lot more confidence to try fixing things myself. I have one more question: what kind of solution did you use to clean the copper corrosion inside the GPU? I saw you mentioned using vinegar—was it just regular white vinegar?

Original:

  • From:pierre-guillaume.raverdy via NVIDIA Developer Forums<notifications@nvidia.discoursemail.com>
  • Date:2025-04-28 05:51:14(美国山区 (GMT-07:00))
  • To:lv.xing<lv.xing@sanmedbio.com>
  • Cc:
  • Subject:[NVIDIA Developer Forums] [Accelerated Computing/DGX User Forum] NVIDIA DGX Station (V100): One GPU gets very hot after booting

| pierre-guillaume.raverdy
April 28 |

  • | - |

Hi,

We did use Watercooling Ekwb EK-CryoFuel Premix 1L (Yellow one, but I uess any color is fine) Thats the one recommended by nvidia for this setup (the clear one, but could not order it).

Dissasembling everything was not too hard (that was our first experience with a watercooling system).

  • Removing the first card was a bit difficult (accessing the lock/security at the end of the pci slot since not much space).
  • Used a toothpick and patience to remove the rubber seals (also the ones on the side of the card). Please note where is belongs (not exactly the same size.
  • If you do one or two cards at a time, you always have the others to doublecheck where ieach screw belongs
  • once card is unplugged, dissasembling was easy (card first, then the cooling system.
  • you can use the tubes from a dismantled card, connect them again to the bottom plugs in the dgx station and purge the system
  • tried to purge as much as possible, then did put back coolant in the block
  • one cleaned and assembled again, connected the card just to the watercooling system (not to the pci slot) and turned on the station, the pump started turning and did push new coolant into the top radiators and into the card (had to refill the bloc several time). Do this for each card (without going too high or too low on the indicators on the bloc/reservoir inside the station
  • you can leave the cards on the side (not connected to the pci slot) while the pump/station is running for a bit of time and make sure there are no leaks