I’m having problems with the NVIDIA DGX A100. The server randomly shuts down without any error messages. After the first such incident, the machine was able to start by completely disconnecting and reconnecting power to it.
Yesterday the server went offline again, but this time I can’t turn it back on. On all power supplies, all upper LEDs are solid green and all lower LEDs are blinking orange. Also, the codes “pd” and “4b” are displayed alternately on the segment display. There are no error messages in the logs. After a power reset, access to the BMC was lost both through the web interface and through the IPMI.
Has anyone faced such a problem?
@cepra.x you should open a case on the enterprise support https://nvid.nvidia.com, and let them know the about the “pd/4b” code. They will help you resolve the issue.
Thanks @ilb. I’ll write a request to the enterprise support.
Does anyone know if there is a description of the codes on the segment display somewhere? I think that such information would be useful not only to me.
FWW. I had two cases, one with half 7s (only top portion of the led), which ended up being a GPU tray failure, and Pd/4A + alternating Pu/4A, Pu/4B, which ended up being a MB tray failure.
In both cases the Enterprise Support was super helpful. The replacement part was shipped to my location and a local CE came on-site to do the actual replacement. All I had to do was ship back (pre-payed) the faulty part. The issues were resolved very fast, i.e. the affected system was down for just a couple of days.
Thanks a lot, @ilb. I hope that the enterprise support will quickly solve the problem with my system as well.