Nvidia L4 cards in Dell R450 Servers Hot Upon Installation

djh3 · August 8, 2024, 6:54pm

We have previously installed dual T4 cards in an R450 server. The cards ran nicely and idle at 30C.

I installed single L4 cards in a few more R450 servers, but before I configured any drivers or used any of them, I noticed these cards were too hot to handle comfortably. (I had to go back in to remove the plastic tape from one.) nvidia-smi indicates these cards are around 70C at idle. My Data Scientist finds the L4 is benchmarking worse than the T4, and we reckon that the card is throttling due to thermal management.

The pattern we see, comparison with AWS, or our T4s, is the clocks are running high at idle (2000+ MHz versus hundreds) so it’s like they’re locked into some kind of non-scaling mode.

L4 clocks at idle:

    Clocks
        Graphics                          : 2040 MHz
        SM                                : 2040 MHz
        Memory                            : 6250 MHz
        Video                             : 1770 MHz
    Applications Clocks
        Graphics                          : 2040 MHz
        Memory                            : 6251 MHz
    Default Applications Clocks
        Graphics                          : 2040 MHz
        Memory                            : 6251 MHz

T4 clocks at idle:

    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz
    Default Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz

We see Performance State P8 on our T4, but the L4s alternate between P8 and P0. It looks like P0 may be a “thermally throttled” state above 70C.

I generated a nvidia-bug-report.log.gz but do not see how to post it. Is it possible to obtain technical support from Nvidia?

djh3 · August 8, 2024, 10:39pm

The answer here may be to enable persistence mode. It seems that clock scaling does not occur unless the Nvidia driver is loaded.

> ps auxww | grep nvidia
root        1423  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
root        1424  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
> nvidia-smi -q | grep Persist
    Persistence Mode                      : Disabled
> sudo nvidia-smi -pm 1
Enabled Legacy persistence mode for GPU 00000000:17:00.0.
All done.
> ps auxww | grep nvidia
root        1423  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
root        1424  0.0  0.0      0     0 ?        S    14:23   0:00 [nvidia-modeset/]
root        6271  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/204-nvidia]
root        6272  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/205-nvidia]
root        6273  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/206-nvidia]
root        6274  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/207-nvidia]
root        6275  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/208-nvidia]
root        6276  0.0  0.0      0     0 ?        S    14:26   0:00 [irq/209-nvidia]
root        6277  0.0  0.0      0     0 ?        S    14:26   0:00 [nvidia]
> nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk 
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz 
    0     32     68      -      0      0      0      0      0      0   6250   2040 
    0     32     67      -      0      0      0      0      0      0    405    285 
    0     16     67      -      0      0      0      0      0      0    405    210 
    0     16     66      -      0      0      0      0      0      0    405    210

At first, the driver is not loaded: no [nvidia] in ps – persistence mode causes the driver to stay loaded, the clock speed drops, and the temperature drops. In my case, we’ve gotten down to 48C, and there may be some optimizations we can make in the physical deployment, like staggering the placement of cards in vertically adjacent systems.

Topic		Replies	Views
Clock/Performance of Tesla T4 on linux Linux performance	0	924	March 14, 2022
GPU throttling? Video Processing & Optical Flow	1	773	November 18, 2019
Nvidia L40S overheating and exclamation mark in device manager cuDNN	4	352	December 21, 2024
Testla T4 always stays in P0 pstate Linux	3	2236	October 12, 2021
Tesla P4 is stuck in P0 Linux cuda , tensorflow	10	2994	January 21, 2025
GPU temperature keeps increasing just with a single memory allocation. CUDA 4.0 + CUDA Programming and Performance	7	15480	April 6, 2011
Nvidia Card suffers under room temperatur Linux	1	647	August 13, 2015
Nvidia-smi -gtt option since 460.27 causes major performance issues on laptops Linux	5	4038	April 30, 2021
RedHat 7.4 with Tesla P40 * 4 work abnormal with driver 384.81 and 384.125 Linux	5	879	April 5, 2018
Quadro T2000 throttles down to 300MHz and stays there Linux	44	6201	February 15, 2021

Nvidia L4 cards in Dell R450 Servers Hot Upon Installation

Related topics