Our DGX kept shutting down on its own recently. After checking the kernal.log, we found “core temperature above threshold” on the CPUs (see figure1). We opened up the DGX and found that we were running out of coolant for the GPUs in the coolant tank. However, the coolant for CPUs seems to be stored inside the coolant pipes only and isolated from the coolant tank for the GPUs, and we couldn’t find any indicators of how much CPU coolant we still have. After checking the DGX documentation, the coolant kit seems to only refill the coolant tank which is only for GPUs. What about refilling the coolant for the CPUs?
I will be grateful for any advice if I misunderstood anything or any other helpful information that may help solve this issue.
Hi,
we are currently facing a similar issue with our DGX V100 System which reaches CPU Temperatures around 100°C (according to lm-sensors) without any load on the system.
Changing thermal paste didn’t solve the issue.
Did you in the meantime find any solution or further ways to troubleshoot the issue?
Hello @k50112113,
we have the same problem with a “DGX Station v100” and to me it looks like the CPU cooler is a closed water-cooling from Corsair, with no possibility to refill/change something. How did you refill/change the coolant of this system?
Hello @schreihs,
have you found a solution for your problem?
I have seen some tutorials of refilling the Corsair cooler but you have to open up the closed coolant loop and refill the coolant skillfully (preventing any air from going in), which seems difficult.
So we resolved this issue by simply replacing the entire Corsair cooler. This is the one we bought: CORSAIR - iCUE H60X RGB ELITE AIO Liquid CPU Cooler 120mm Radiator. It is about $ 80 online. Please make sure the “processor socket” specified is compatible with your DGX station.
Hello @k50112113,
thanks for the quick reply and the clarification - that helps a lot. Just to be sure, you have the “DGX Station v100” ( [DGX DL WS 4V100/256GB 32G) as well, right?
Br
we just installed yesterday a normal CPU-Cooling System (be quiet! Pure Rock 2) (without Water). Works perfectly fine with trainings on all 4 GPU cards the CPU stays under 40° Celsius.
Best
I have the same problem with the CPU cooler needing to be replaced for an Nvidia DGX V100 Station, so I ordered a CORSAIR - iCUE H60X RGB ELITE AIO Liquid CPU Cooler 120mm Radiator today. It will be delivered next week.
Can anyone provide me with work instructions for removing the defective CPU cooler from DGX V100 Station and installing the new CORSAIR—iCUE H60X RGB ELITE AIO Liquid CPU Cooler 120mm Radiator in the DGX V100 case and motherboard? I want to make sure that nothing is damaged during this maintenance cycle.
Thanks in advance for your assistance with this matter. I am looking forward to your timely reply.
Thanks for sending a reply to my inquiry on a work procedure. The replacement CPU cooler will be delivered next week, so I will let you know if I encounter any problems. Hopefully, this restores the DGX to operational condition.
The Corsair iCUE H60x RGB ELITE Liquid CPU Cooler arrived yesterday, and it was installed without any problems. The system is able to boot now, but needs to be rebuilt. Therefore, do you know where I can obtain the following items for the DGX V100 Station:
It appears as though the system deleted part of my message. The OS on the DGX V100 needs to rebuilt, but the owner misplaced the OEM media. Therefore, do you know where I can obtain copies of the following two items:
USB recovery flash drive containing a backup copy of the operating system image and CUDA toolkit
DVD-ROM containing source code of open-source software installed on the DGX Station