Nvidia L4 cards in Dell R450 Servers Hot Upon Installation

We have previously installed dual T4 cards in an R450 server. The cards ran nicely and idle at 30C.

I installed single L4 cards in a few more R450 servers, but before I configured any drivers or used any of them, I noticed these cards were too hot to handle comfortably. (I had to go back in to remove the plastic tape from one.) nvidia-smi indicates these cards are around 70C at idle. My Data Scientist finds the L4 is benchmarking worse than the T4, and we reckon that the card is throttling due to thermal management.

The pattern we see, comparison with AWS, or our T4s, is the clocks are running high at idle (2000+ MHz versus hundreds) so it’s like they’re locked into some kind of non-scaling mode.

L4 clocks at idle:

    Clocks
        Graphics                          : 2040 MHz
        SM                                : 2040 MHz
        Memory                            : 6250 MHz
        Video                             : 1770 MHz
    Applications Clocks
        Graphics                          : 2040 MHz
        Memory                            : 6251 MHz
    Default Applications Clocks
        Graphics                          : 2040 MHz
        Memory                            : 6251 MHz

T4 clocks at idle:

    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz
    Default Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz

We see Performance State P8 on our T4, but the L4s alternate between P8 and P0. It looks like P0 may be a “thermally throttled” state above 70C.

I generated a nvidia-bug-report.log.gz but do not see how to post it. Is it possible to obtain technical support from Nvidia?

The answer here may be to enable persistence mode. It seems that clock scaling does not occur unless the Nvidia driver is loaded.

> ps auxww | grep nvidia
root        1423  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
root        1424  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
> nvidia-smi -q | grep Persist
    Persistence Mode                      : Disabled
> sudo nvidia-smi -pm 1
Enabled Legacy persistence mode for GPU 00000000:17:00.0.
All done.
> ps auxww | grep nvidia
root        1423  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
root        1424  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
root        6271  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/204-nvidia]
root        6272  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/205-nvidia]
root        6273  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/206-nvidia]
root        6274  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/207-nvidia]
root        6275  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/208-nvidia]
root        6276  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/209-nvidia]
root        6277  0.0  0.0      0     0 ?        S    14:26   0:00 [nvidia]
> nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk 
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz 
    0     32     68      -      0      0      0      0      0      0   6250   2040 
    0     32     67      -      0      0      0      0      0      0    405    285 
    0     16     67      -      0      0      0      0      0      0    405    210 
    0     16     66      -      0      0      0      0      0      0    405    210 

At first, the driver is not loaded: no [nvidia] in ps – persistence mode causes the driver to stay loaded, the clock speed drops, and the temperature drops. In my case, we’ve gotten down to 48C, and there may be some optimizations we can make in the physical deployment, like staggering the placement of cards in vertically adjacent systems.