PUSU3 and PSU4 on our dgx-1 failed simultaneously, as did both ethernet ports.
But management port still works and all sensor readings are green other than PSU3 and PSU4 (0 Watts)
We verified that our power outlets are working, reseated power plugs in DGX-1, swapped power supplies in PSU bays 1 and 2 with power supplies in bays 3 and 4. Result: power supplies that did not work in bays 3 and 4 did work in bays 1 and 2. And power supplies that formerly worked in bays 1 and 2 do NOT work in bays 3 and 4. Conclusion: all 4 power supply units are actually OK and problem lies elsewhere,
We lost network link to the dgx-1 at the same moment that the 2 PSUs failed and there are no lights on the ethernet ports. (We do see lights and have connectivity to the management ethernet port.)
A little surprised machine is running at all, as my understanding is that it is a “3 + 1” PSU setup meaning only 1 redundant PSU.
Any suggestions on how to revive the network and/or power supplies ?
You’ve correct that the DGX-1 is 3+1 redundant, but it still tries its best to start with >1 failure. Very weird that the 10GbE ports died at the same time, as those are off on a mezzanine card not really touching the PSUs.
This sounds like a perfect thing for NVIDIA Enterprise Support to solve @user158525 . :-)
Seems like you need at least one more PSU to get back to a normal power state. Since it’s out of support and end-of-sale, we don’t have any PSUs to sell. The PSU should be a Quanta 1HY9ZZZ076Y , but I’m not sure those are available from anywhere. There’s always buying a used DGX-1 from somewhere, and using the PSUs and Ethernet mezzanine card from it. :-)
Scott, thanks for your responses. But since the two seemingly dead PSUs started working again when moved to the PSU “sockets” formerly used by the two working PSUs, I’m skeptical that the issue would be resolved by buying replacement PSUs.
What might cause both PSU3 and PSU4 “socket” to fail at same time ? Is there some way to remotely test the functionality of the Mezzanine card, eg by attempting a firmware update ?
Aha! All the 4x PSUs are connected to a single PDB (Power Distribution Board) in the system. Maybe your failure is that the PDB has some failures (like, the slots where PSU 3 & 4 are)? the answer in that case would then be to replace the PDB. This is something the NVIDIA field support teams would do for systems under support, so it’s non-trivial, but possible.
It begs the question of “Where can I buy a PDB?”…unfortunately I think the answer would end up as “find a cheap for-parts DGX and get it from there”. :-(
Oh, good find on the user guide! I totally forgot we had a picture of the underside of the card there!
Are you actively using all of the 4x NICs on the rear of the system? Those are dual-personality, and can be either InfiniBand or Ethernet, so depending on your setup you could use one of them to connect to your network. In theory USB-Ethernet would work, but performance wouldn’t be ideal - I’m sure fine for just SSH’ing in, but you wouldn’t want to access storage or other high-speed stuff that way.
Those connectors are QSFP28, but you can turn them into SFP+ with something like an MAM1Q00A-QSA28 adapter.
I don’t know the PDB part number - we’ve only offered that as an NVIDIA RMA-able part (with an NVIDIA part numbers). I’m not sure if Quanta ever sold the PDB themselves as a standalone component. I reconfirmed with our support team that we don’t have the ability to sell DGX-1 components directly (only use them as part of an existing support contract).
Sorry. :-( The DGX-1 is still a really cool and useful system still, I feel bad that yours is out of support and having issues!