Hey all, I realize that this is not a hardware forum, but I’m not aware of any Nvidia hardware forums so I’m posting here in hopes that someone has some information. I’ve been trying to run some deep learning models on an GTX 1070 in the MXM form factor (Aetina M3N1070-NN) and their accompanying PCIe to MXM conversion board.
The problem that we are having is huge instantaneous power consumption when we run our deep learning models. The user manual states that the MXM GPU is 115W, but we see peak power consumption of 240W and an average consumption of around 180W. We’re running this in a reasonably constrained environment (hence the MXM board) so we’d like to see if this is expected behavior and if anyone has any experience with the real power consumption on Nvidia GPUs.
- We are using a 12V benchtop power supply for the GPU auxiliary power that is capable of 50A
- We are running inference using tensorflow's SSD model on 640x640 images
- nvidia-smi reports GPU power usage of 80W during the high usage time. I'm guessing I can't trust this because the high power draw we're seeing is somewhat instantaneous (20ms)
Does anyone know what kind of power consumption I should expect? Even experiences from normal desktop cards would be helpful.
Any information is appreciated
Is this a number reported by nvidia-smi, or how did you measure this? Reliably determining power draw of a GPU in isolation by external means is a non-trivial undertaking in my experience. Double check that you are measuring the actual power draw by the GPU alone.
What is the “user manual” you are referencing? On the internet I find power specifications for the MXM version of the GTX 1070 listed anywhere from 115W to 125W.
My general experience with NVIDIA GPUs (excluding MXM types, which I have never used) is that NVIDIA power management guarantees with a high degree of reliability that longer-term power consumption does not exceed the NVIDIA specified maximum power when running a GPU at the NVIDIA-specified clock rates (so non-overclocked). But short-term power spikes in the tens of milliseconds are possible before the power management is able to curb power draw, with spikes typically not exceeding 125% of nominal maximum.
I think nvidia-smi returns something like a one-second moving average for the power number. Also, on-GPU sensors are typically not super accurate. I would assume tolerances of +/-5%, unless NVIDIA specifically specifies tighter tolerances.
Since you refer to the average power consumption as being 180W, I assume that is the long-term rate you observe. Given that the desktop version of the GTX 1070 is specified with a maximum power of 150W, something does not add up here. Have you double-checked the supply voltage? I don’t know what the GTX 1070 uses, but would expect maximum voltage <= 0.96V or so. If the voltage checks out, I would suggest you discuss your findings with the GPU vendor, Aetina.
Hey njuffa, thanks for your reply.
I did some more measuring of the power consumption and the calculation of the average power consumption was incorrect. It is actually ~125W on our “heavier” models which is pretty close to the spec of 115W from Aetina. We’re still seeing the huge power spikes of >300W, but we plan to just put a more powerful supply on it as all the other functionality is working.
Once again thanks for your help!
On what kind of time scale are you observing a power draw of 300W? Microseconds? Milliseconds?
Frankly, I find it hard to believe you are seeing 300W spikes on a part designed for a maximum draw of ~120W. As I said, I am not familiar with MXM modules. But I am wondering whether (a) your power supply is operating in accordance with the module requirements for power supply, and (b) you are using a suitable measuring methodology.
Does the module have an auxiliary power connector? Desktop GPUs require an auxiliary power connector if they draw more than 75W, as that is the power limit of the PCIe connector.
The spikes are on the scale of 10 milliseconds. The unit is rated at 120W but that appears to be the average consumption while things are running, not the maximum draw.
There are two auxiliary power connectors, one for the dev board that converts PCIe to MXM and one the connects directly to the MXM module. The conversion board is weird because the 12V PCIe lines are disconnected, but that makes it possible for us to supply the auxiliary power with a different voltage level.
The MXM 1070 has an extra power port because it draws above the maximum of MXM.
Measuring power draw accurately is more difficult when there are multiple power connectors. 10ms power spikes to 300W are contrary to what I would expect; I also have never seen any reports of such massive spikes in connection with GPUs.
As I stated, spikes of about 25% over maximum specified is what I would expect to see at that scale. I wonder whether there could be an issue with a current limiter circuit or somesuch. But I am not a EE and not a specialist for power supplies; my knowledge derives from working alongside people with that skill set and from rudimentary experiments of my own.
If you are not a power supply specialist yourself, you might want to run this issue by someone who is, as well as the board vendor. You mentioned some sort of adapter is in play. Is that purely passive or does it have active components as well?
[Later:] I trawled the net for reports of unusually large power spikes with GTX 1070 MXM and found nothing. One user reported spikes reaching 140W, which is roughly in line with my expectations.