while running our AI inference tasks, usually the power usage of Tesla T4 is around or below 70W. But we observed that it can sometimes soars to 90W or even 100W (although this lasted only for a short range of time)…, as the attached image:
I can understand that the peak power usage could be 5% or 10% more than the 70W maximum, i.e. 73.5W or 77W. But 100W is perhaps too much. Could this be a hardware / driver issue? Or was the output of
In my understanding, that shouldn’t happen since the pcie slot has a power budget of only 75W. Seems you’re running Windows, so I don’t really know what to look at.
Yes, I’m using windows. But can switch to Linux, too, if needed.
Don’t really know if that is worth the hassle to install Linux.
You could first check if there’s a Windows equivalent to ‘lspci’ and check if the underlying pci bridge the T4 is connected to properly sets the PowerLimit
Thanks for the answer. We do have one Tesla T4 installed on a server, which is running Ubuntu…
However, I didn’t see any attribute similar to PowerLimit from the output of the lspci command?
does output the PowerLimit
tells which bus the T4 is connected to.
On the Linux server, the T4 is also jumping the power limit? Xorg stopped, nvidia-persistenced started?
e.g. my gpu 0000:01:00.0 is on bus 0000:00:01.0
lspci -vv -s 0000:00:01.0
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #1, PowerLimit 75.000W; Interlock- NoCompl+
Hi There, We have the same issue on Linux systems. Is there any solution or explanation?
Since the pcie specs are not openly available, it’s impossible to say whether this is a spec violation or an allowed overcurrent for a short period of time.
Where are metrics extracted from, when I run nvidia-smi utility?
Are those metrics extracted from PCI-E registers or from Tesla T4 card?