We have previously installed dual T4 cards in an R450 server. The cards ran nicely and idle at 30C.
I installed single L4 cards in a few more R450 servers, but before I configured any drivers or used any of them, I noticed these cards were too hot to handle comfortably. (I had to go back in to remove the plastic tape from one.) nvidia-smi indicates these cards are around 70C at idle. My Data Scientist finds the L4 is benchmarking worse than the T4, and we reckon that the card is throttling due to thermal management.
The pattern we see, comparison with AWS, or our T4s, is the clocks are running high at idle (2000+ MHz versus hundreds) so it’s like they’re locked into some kind of non-scaling mode.
L4 clocks at idle:
Clocks
Graphics : 2040 MHz
SM : 2040 MHz
Memory : 6250 MHz
Video : 1770 MHz
Applications Clocks
Graphics : 2040 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 2040 MHz
Memory : 6251 MHz
T4 clocks at idle:
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
We see Performance State P8 on our T4, but the L4s alternate between P8 and P0. It looks like P0 may be a “thermally throttled” state above 70C.
I generated a nvidia-bug-report.log.gz
but do not see how to post it. Is it possible to obtain technical support from Nvidia?