SW Power Cap always Active

Hi all!

I have a desktop with two 1080 Ti. Though one of them says that the SW Power Cap is Active. Always. Even when idling. And never goes over P5 / 50W. I played with nvidia-smi -pl but it doesn’t seem to work.

Here is nvidia-smi output. The first GPU (id 0) is perfectly fine and works as expected. The second one is dramatically throttled.

Help will be greatly appreciated. Thanks!

==============NVSMI LOG==============

Timestamp                           : Mon Feb 25 12:19:56 2019
Driver Version                      : 410.79
CUDA Version                        : 10.0

Attached GPUs                       : 2
GPU 00000000:03:00.0
    Product Name                    : GeForce GTX 1080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-7263c86a-868f-3cdb-ab9f-c2ed0e0f3d84
    Minor Number                    : 0
    VBIOS Version                   : 86.02.40.00.4D
    MultiGPU Board                  : No
    Board ID                        : 0x300
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x03
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0610DE
        Bus Id                      : 00000000:03:00.0
        Sub System Id               : 0x1B0610B0
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 51 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11178 MiB
        Used                        : 10 MiB
        Free                        : 11168 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 72 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 17.12 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 265 MHz
        SM                          : 265 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1936 MHz
        SM                          : 1936 MHz
        Memory                      : 5505 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

GPU 00000000:82:00.0
    Product Name                    : GeForce GTX 1080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-bc474f2e-078d-7e79-1a1b-8bb318c3e119
    Minor Number                    : 1
    VBIOS Version                   : 86.02.40.00.4D
    MultiGPU Board                  : No
    Board ID                        : 0x8200
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x82
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0610DE
        Bus Id                      : 00000000:82:00.0
        Sub System Id               : 0x1B0610B0
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 25 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11178 MiB
        Used                        : 0 MiB
        Free                        : 11178 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 40 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 13.01 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1936 MHz
        SM                          : 1936 MHz
        Memory                      : 5505 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The data from nvidia-smi seems to have been dumped with both GPUs at idle, and I don’t see any evidence of power capping. None of the data looks suspicious in any way, and both GPUs are running the same VBIOS version. I assume you made sure to place exactly the same load on either GPU individually for your tests, or used a multi-GPU capable app that is able to put a full load on both cards simultaneously.

Double check power supply cabling: Each GPU should have one 6-pin and one 8-pin PCIe power connector attached. No Y-splitters or converters (Molex-to-PCIe or PCIe 6-pin-to-8-pin) must be used in the cabling for those. Connectors must be fully engaged with the receptacles at the GPU; usually a small tab locks into place with a “click” sound.

Is the power supply (PSU) adequately sized? You would need a PSU rated for at least 1000W, possibly 1200W, depending on what other components are used in the system. Provided you have sufficient power supply, it will be save to use nvidia-smi to raise the enforced power limit from the default of 250W to the maximum allowed value of 300W, as this will make hitting the power limit (and thus capping) less likely.

When you physically swap the GPUs between their PCIe slots, does the power cap issue move with the GPU or does it apply to a particular slot?

I would suggest monitoring GPU temperature and fan speed when both GPUs are running with 100% load to make sure you are not hitting a thermal cap rather than a power cap. Even though the GTX 1080 Ti typically uses a blower type fan that exhausts hot air to the outside of the case, one GPU will have a somewhat obstructed air intake (due to the GPU next to it) and I would expect that GPU to run hotter. This problem is much more pronounced with open-fan designs like those found in the RTX line (Turing architecture).

The data from nvidia-smi seems to have been dumped with both GPUs at idle

It is. That might have been a bit stupid of me, yeah. Here is a sample of nvidia-smi exhibiting the issue. It’s the same program on both GPUs. You’ll see the GPU1 at 100% utilization, but stuck in P5, ~50W.

Mon Feb 25 18:56:54 2019                                                       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:03:00.0 Off |                  N/A |
| 54%   84C    P2   252W / 250W |  10579MiB / 11178MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:82:00.0 Off |                  N/A |
| 25%   44C    P5    45W / 250W |  10573MiB / 11178MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     12016      C   python3                                    10569MiB |
|    1     21061      C   python3                                    10563MiB |
+-----------------------------------------------------------------------------+

make sure you are not hitting a thermal cap rather than a power cap.

It happens even when the other GPU is idling and cold. HW Thermal Cap shows “Not Active” and SW Power Cap is active.

==============NVSMI LOG==============                         
                                                              
Timestamp                           : Mon Feb 25 19:07:04 2019
Driver Version                      : 410.79                  
CUDA Version                        : 10.0                    
                                                              
Attached GPUs                       : 2                       
GPU 00000000:03:00.0                                          
    Performance State               : P2                      
    Clocks Throttle Reasons                                   
        Idle                        : Not Active              
        Applications Clocks Setting : Not Active              
        SW Power Cap                : Not Active              
        HW Slowdown                 : Not Active              
            HW Thermal Slowdown     : Not Active              
            HW Power Brake Slowdown : Not Active              
        Sync Boost                  : Not Active              
        SW Thermal Slowdown         : Active                  
        Display Clock Setting       : Not Active              
                                                              
GPU 00000000:82:00.0                                          
    Performance State               : P8                      
    Clocks Throttle Reasons                                   
        Idle                        : Not Active              
        Applications Clocks Setting : Not Active              
        SW Power Cap                : Active                  
        HW Slowdown                 : Not Active              
            HW Thermal Slowdown     : Not Active              
            HW Power Brake Slowdown : Not Active              
        Sync Boost                  : Not Active              
        SW Thermal Slowdown         : Not Active              
        Display Clock Setting       : Not Active

I’ll check the other leads and get back to you ASAP.

For now, thank you for your answer!

Interesting to see that one GPU shows “SW Thermal Slowdown : Active” while the other shows “SW Power Cap : Active”.

The fact that you don’t see the power-capped GPU drawing more than 45W - 50W may be a big clue. That is roughly how much power an NVIDIA GPU typically draws through the PCIe socket (which is specified to supply up to 75W). So this may be an indication that the PCIe power cables for this GPU aren’t hooked up.

[Later:] There are reports of NVIDIA GPUs stuck at low clocks on the internet. In some cases people were supposedly able to fix the issue simply by installing a new driver and rebooting the machine. Seems worth a try even though I couldn’t explain why a driver issue would only affect one of two identical GPUs.

Can you link to such reports? Or - let us know which cards they regard? I currently may be having a similar issue with a T4 card (or maybe not).

I am afraid the results of an ad-hoc Google search from February 2019 aren’t available to me at this time. I would suggest Googling for NVIDIA GPUs stuck at low clocks and see what you can find in the first ten pages of results. That’s pretty much what I did back in 2019. Any actionable information I could find I distilled into the comment above.