My DGX seems to have powered off on its own and when I attempt to turn it back on, it is unresponsive. After checking the event log in the BMC, here are the events that occurred near the time it went down:
From the log, you can see that from August 27 there were not any reported events until the power down occurred on November 9. After that, the log shows that I attempted to turn it on without any success.
Currently all power supplies seem OK (green lights are flashing on each indicating they are on standby) so I don’t think it is a power issue. Here is the current status of the power supplies from the BMC:
Also, when I attempt to turn the DGX on using server power control within the BMC, the BMC makes two attempts to turn it on and then comes back with a message “Performing power action failed.”
Apart from the event log, I cannot find any information that can help me to understand why it powered off. (No other servers in the building powered off at that time so I don’t think it was a building power issue). Could somebody point me to other possible logs that might help me to better diagnose the issue?
Also, are there any logs that are created when the BMC fails to turn on the server? These might also help to me diagnose the issue.
Any help will be greatly appreciated.
Hi @joseph29 ,
That’s indeed peculiar! I assume you get the same behavior if you try and power it on with the physical button?
I don’t believe there are any logs captured beyond the SEL (aka, what you see currently, which should roughly match
ipmitool sel elist).
Have you tried physically removing power from all the power supplies (pull the plugs, or turn off the PDU ports if you have that ability) - that can help get the whole system out of a funky state. If that’s not possible, you may want to try rebooting the BMC (
ipmitool mc reset cold), although I don’t suspect that’ll change anything.
This really sounds like it’d be best handled by our Enterprise Support team though - my main recommendation is contacting them. See About the DGX User Forum / Note: this is not NVIDIA Enterprise Support - #4 for info on how to do that.
Thank you very much for the reply, @ScottEllis .
Yes, I have also attempted to power it on by pressing the physical button, but the server is also unresponsive.
Thank you for the suggestion to unplug the server completely. I will attempt that when I am able to physically access the server tomorrow.
I appreciate you mentioning the equivalent commands using the ipmitool. Might you be able to tell me how to access this command-line tool? From the manual, it seems that it is only available within the OS on the DGX-1 itself. I did try to ssh to the BMC but I get a SMASH console and the ipmitool does not appear to exist within this environment.
If I am not able to reboot the machine after attempting your suggestions, I plan to contact Enterprise support.
Thank you again,
The IPMI commands can be executed with
ipmitool locally (as you see in the manual, but which doesn’t work for you obviously!), or remotely. To use them remotely, you specify the username, password, hostname, and “interface” (not what you’d think). For example:
ipmitool -U mybmcusername -P mybmcpassword -H 18.104.22.168-I lanplus mc info
If you run that on some other system with network connectivity to the DGX-1 BMC, and substitute credentials for
mybmcpassword and the IP or hostname of the BMC for
22.214.171.124 then it should run the
mc info command. If that works, then
mc reset cold would be the next step.
Thank you for the information. I was able use the ipmitool remotely using the commands you mentioned. Unfortunately, resetting the bmc and attempting another power on did not work, but this will help tremendously in the future for logging and checking the status of the server.
Thank you again,
Shucks! I’m confident Enterprise Support will help get things figured out though.
Hi @ScottEllis ,
I was able to unplug all four power supplies as you suggested and after plugging them in again, I was able to boot the server! Thank you very much for your help!
Yay! Not sure why it got into that weird state, but at least it’s out of that state now. :-)
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.