while running our AI inference tasks, usually the power usage of Tesla T4 is around or below 70W. But we observed that it can sometimes soars to 90W or even 100W (although this lasted only for a short range of time)…, as the attached image:
I can understand that the peak power usage could be 5% or 10% more than the 70W maximum, i.e. 73.5W or 77W. But 100W is perhaps too much. Could this be a hardware / driver issue? Or was the output of nvidia-smi inaccurate?
In my understanding, that shouldn’t happen since the pcie slot has a power budget of only 75W. Seems you’re running Windows, so I don’t really know what to look at.
Don’t really know if that is worth the hassle to install Linux.
You could first check if there’s a Windows equivalent to ‘lspci’ and check if the underlying pci bridge the T4 is connected to properly sets the PowerLimit
lspci -vv
does output the PowerLimit
lspci -t
tells which bus the T4 is connected to.
On the Linux server, the T4 is also jumping the power limit? Xorg stopped, nvidia-persistenced started?
Since the pcie specs are not openly available, it’s impossible to say whether this is a spec violation or an allowed overcurrent for a short period of time.