Is there any tool to monitor/measure power consumption of a DGX box. nvidia-smi only lists the power consumption on GPUs but not the entire node (e.g., CPU, memory chips).
I didn’t find an appropriate section to put this question. Please help to move this question to appropriate section please. Thanks.
Hi @llodds !
You can get the total system power in-band or out-of-band via IPMI. For example:
dgx05:~$ sudo ipmitool sensor get PWR_SYSTEM
Locating sensor record...
Sensor ID : PWR_SYSTEM (0x6e)
Entity ID : 7.0
Sensor Type (Threshold) : Chassis
Sensor Reading : 1560 (+/- 0) Watts
Status : ok
Lower Non-Recoverable : na
Lower Critical : na
Lower Non-Critical : na
Upper Non-Critical : 19890.000
Upper Critical : 19890.000
Upper Non-Recoverable : na
Positive Hysteresis : Unspecified
Negative Hysteresis : Unspecified
Assertion Events :
Assertions Enabled : unc+ ucr+
Deassertions Enabled : unc+ ucr+
This shows that the current total system power is 1560W (this is an idle system, obviously :-) ). If you wanted individual components,
ipmitool sensor will give you all known sensors, with power ones all having the
Hi @ScottEllis Thanks for the reply. I don’t have the sudo access. It’s a DGX clone on Azure. Is there a tool that I can use without having the sudo access. Thanks.
I am not aware of a way to do that. Even tools like powertop require access to kernel and sysctls that aren’t normally accessible by an unprivileged user.
If it’s a cloud instance, I’m curious why you care about power usage. Is this to see how close to “optimal” your code is?
OK. Thanks for the information. I will talk to the sysadmins who have the root access to see if they can get the power consumption. Eventually we want to buy some DGXs in-house. Cloud DGX is merely a test box. We want to get some exact numbers to compare with those for in-house CPU boxes. Yes, we would like to know the percentage of the actual power consumption compared to the reported peak power of DGX machine. We would also like to see whether we can tune down the CPU/GPU frequency to lower the power consumption without hurting much on the application’s performance (application is memory bandwidth bound). Always go greener!