Tesla P4 is stuck in P0

Hello,

I have just installed a Tesla P4 compute card in my Manjaro (Arch Linux based) system in order to use it with the Willow Inference Server project among other things.

The expected behavior is that the docker image starts up, loads models in VRAM, warms them up and then sits idle waiting for commands. This means that one should see this kind of behavior on the GPU:

P8 → P0 → P2

Sadly, in my case, the card gets “stuck” in P0 state, drawing around 30W and in less than 10 minutes it reaches 93C meaning it simply stops for thermal overload.

Do you have any idea what could cause the card to get stuck in P0?
nvidia-smi -q gives me this when WIS is idling:

Driver Version                            : 530.41.03
CUDA Version                              : 12.1

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : Tesla P4
    Product Brand                         : Tesla
    Product Architecture                  : Pascal
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0420218024108
    GPU UUID                              : GPU-701701f0-6d5b-67c5-3371-afbf742d22b0
    Minor Number                          : 0
    VBIOS Version                         : 86.04.55.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : 900-2G414-0000-000
    GPU Part Number                       : 1BB3-895-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G414.0200.00.03
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1BB310DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x11D810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
                Device Current            : 3
                Device Max                : 3
                Host Max                  : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 7680 MiB
        Reserved                          : 73 MiB
        Used                              : 3956 MiB
        Free                              : 3650 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 61 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 27.53 W
        Power Limit                       : 75.00 W
        Default Power Limit               : 75.00 W
        Enforced Power Limit              : 75.00 W
        Min Power Limit                   : 60.00 W
        Max Power Limit                   : 75.00 W
    Clocks
        Graphics                          : 1113 MHz
        SM                                : 1113 MHz
        Memory                            : 2999 MHz
        Video                             : 999 MHz
    Applications Clocks
        Graphics                          : 885 MHz
        Memory                            : 3003 MHz
    Default Applications Clocks
        Graphics                          : 885 MHz
        Memory                            : 3003 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1531 MHz
        SM                                : 1531 MHz
        Memory                            : 3003 MHz
        Video                             : 1379 MHz
    Max Customer Boost Clocks
        Graphics                          : 1113 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 175366
            Type                          : C
            Name                          : gunicorn: worker [main:app]
            Used GPU Memory               : 3954 MiB

I understand that it might well be WIS that is at fault here, but to rule that out, can you suggest a docker image that I could use to test the same kind of behavior? Something that starts up, loads a model in VRAM and then sits idle not computing anything on the GPU.
I tried the nbody sample docker but it stops after it has done its computation so the GPU goes back to P8 and the test is non conclusive.

Any suggestion is most welcome as I’m a bit lost as to what I’m missing here.

Regards

Just for reference, I have used the frigate docker image and it shows the same behavior, despite their documentation showing it should go to P2 state.

I came here with a similar issue. I placed a P4 on my robot and it stays in P0 mode at 23W idle whenever there’s a process running on it (currently I’m running faster-whisper and blenderbotv1) regardless of activity. However, when I switched in a RTX2060 to test it out, with no change of settings or drivers, the 2060 will drop down to P8 mode at 6W idle after a few moments of no activity. I was a bit surprised that a 170W card actually will consume a lot less than a 70W card. I’m using 535.104.12 driver.

Please make sure nvidia-persistenced is starting on boot.

2 Likes

Thanks for the reply. Yes, its started. All I do is shutdown, swap cards, and reboot. With processes running on the cards, the 2060 will throttle down when there is no active request but the P4 will not. The P4 will throttle down when there’s no processes running on it, but not when there is something running on it.

I took advantage of a Black Friday special and got a GeForce RTX 3060 installed instead of the P4 and I’m observing the same result: the 3060 throttles back to P2 then P8 while the Tesla P4 stays at P0 and locks down for thermal overload.

It seems there definitely is an issue with the Tesla P4 but I’m in no position to find out what it is.

That’s a shame. The P4 really is working well on my robot (small, 75W) except that it is responsible for a considerable amount of current draw and fan noise when its just sitting there idling waiting for someone to ask it a question. I’m curious if there’s an older version of the driver that might actually work better.

p40 also has the same problem.

After finding this:

I’ve come to the conclusion that Tesla cards will not enter P8 state when idling. I can confirm at least the 1050 Ti card can.

I had the same issues on linux when running P40s. Couldn’t find any solution.

The only work around that i found was that this doesn’t occur in Windows. The GPU will power down when idle to about 9-10W, even if a process is loaded.

I just run the P40s on a Windows Server VM now which is a bit of a pain but saves $$

If you install drivers on windows, you will then need to follow this guide and use regedit to get the GPU visible:

1 Like

I got the same issue here on a latest Pop OS Server, running the latest drivers V 565 with an A5000. Idle power consumption is at 90 Watts at 0% utilization. Only short after the boot process the power consumption is below that. Swapping the A5000 for an 4070 Super shows that idle mode works, so no background tasks or stuff.

Did anyone solve this issue?

Update:
Running sudo nvidia-settings -a [gpu:0]/GpuPowerMizerMode=3 reduces the idle power consumption to about 54 Watts.

Other modes like 0, 1 and 2 all make the idle power ramp up again to 90-91 Watts.