Limited clock for the new RTX3090Ti + Ubuntu 20.04

Hi,

We’ve recently bought RTX 3090Ti and installed Ubuntu 20.04 on the PC.

PC config is following:

  • ASUS PRIME Z690-P D4
  • Intel Core i9 12900
  • AiO LCS Arctic Liquid Freezer II 360
  • DDR4 128GB (4x32GB) 3200MHz Kingston
  • SSD 2TB Samsung 980PRO M.2 NVMe + HS
  • Thermaltake Toughpower GF3 1000W Gold

I’ve connected GPU with 1x450W PCI gen5 cable.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  Off |
| 52%   26C    P8   173W / 450W |    359MiB / 24247MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1092      G   /usr/lib/xorg/Xorg                 35MiB |
|    0   N/A  N/A      1592      G   /usr/lib/xorg/Xorg                 92MiB |
|    0   N/A  N/A      1717      G   /usr/bin/gnome-shell               43MiB |
|    0   N/A  N/A      2690      G   /usr/lib/firefox/firefox          170MiB |
+-----------------------------------------------------------------------------+

Output of nvidia-smi -q is:

nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Nov 30 16:14:02 2022
Driver Version                            : 470.141.03
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3090 Ti
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-25dbaeb1-06fd-9cc5-edb0-af1e35453b51
    Minor Number                          : 0
    VBIOS Version                         : 94.02.A0.00.36
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220310DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x88741043
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 6000 KB/s
    Fan Speed                             : 52 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24247 MiB
        Used                              : 350 MiB
        Free                              : 23897 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 15 MiB
        Free                              : 32753 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 3 %
        Memory                            : 14 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 26 C
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 172.68 W
        Power Limit                       : 450.00 W
        Default Power Limit               : 450.00 W
        Enforced Power Limit              : 450.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 516.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 10501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 745.000 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1092
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 35 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1592
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 92 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1717
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 44 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2690
            Type                          : Gnvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Nov 30 16:14:02 2022
Driver Version                            : 470.141.03
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3090 Ti
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-25dbaeb1-06fd-9cc5-edb0-af1e35453b51
    Minor Number                          : 0
    VBIOS Version                         : 94.02.A0.00.36
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220310DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x88741043
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 6000 KB/s
    Fan Speed                             : 52 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24247 MiB
        Used                              : 350 MiB
        Free                              : 23897 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 15 MiB
        Free                              : 32753 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 3 %
        Memory                            : 14 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A

==============NVSMI LOG==============

Timestamp                                 : Wed Nov 30 16:14:02 2022
Driver Version                            : 470.141.03
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3090 Ti
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-25dbaeb1-06fd-9cc5-edb0-af1e35453b51
    Minor Number                          : 0
    VBIOS Version                         : 94.02.A0.00.36
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220310DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x88741043
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 6000 KB/s
    Fan Speed                             : 52 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24247 MiB
        Used                              : 350 MiB
        Free                              : 23897 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 15 MiB
        Free                              : 32753 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 3 %
        Memory                            : 14 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 26 C
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 172.68 W
        Power Limit                       : 450.00 W
        Default Power Limit               : 450.00 W
        Enforced Power Limit              : 450.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 516.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 10501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 745.000 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1092
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 35 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1592
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 92 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1717
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 44 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2690
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 160 MiB

        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 26 C
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 172.68 W
        Power Limit                       : 450.00 W
        Default Power Limit               : 450.00 W
        Enforced Power Limit              : 450.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 516.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 10501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 745.000 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1092
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 35 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1592
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 92 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1717
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 44 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2690
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 160 MiB

            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 160 MiB

What bothers me:

 Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active

Where PC basically alternates in between SW Power Cap and Idle power cap thus limiting GPU to P5 or P8 performance mode. In P8 GPU runs at max 210 MHz clock, which is quite sad.

Any help would be appreciated cause in this form, card is basically useless, ty, cheers :)

The Idle power cap indicates that the GPU doesn’t think it has any work to do. In that case, doing no work at high clocks is not any better/worse than doing it at low clocks, from a performance perspective, and from a power perspective doing it at low clocks is definitely preferable. I wouldn’t call this “sad”, I would call it “smart”. It’s OK if you disagree, we’re each entitled to our opinions.

The SW Power Cap indicates that the GPU has observed that its own power consumption exceeds either the designed maximum for the GPU or the user-supplied power cap setting (not relevant for GeForce GPUs), and so the GPU intends to limit its power consumption by limiting its clocks.

There isn’t anything you can do about either of these mechanisms. Both of these indications may present themselves rapidly and/or for short periods of time, depending on what the GPU is doing (what work it is doing). Certainly I would expect that many advanced GPU workloads such as DL training could drive the GPU into a SW Power Cap state, and its fully expected that when in that state the performance state will be something lower than P0 or P2.

Beyond that, I’m not sure what concerns there may be, there isn’t enough information here to diagnose anything, and I’m not sure there is anything to be diagnosed.

In any event if you recently purchased the GPU, you can always investigate the vendor’s return policies if you are unsatsified with it.

You’ve misunderstood me. I’m reffering to the SW Power Cap which happens during DL training procedure. It limits RTX 3090Ti clock on 210 MHz. Stuff trains 10x faster with GTX 1080Ti which is in another PC. On another hand, I also have RTX 3090 in another PC with 850W PSU and it runs smoothly with same Ubuntu and NVIDIA drivers. Therefore something’s not right here - and I wouldn’t jump to conclusions so fast without reffering to the diagnostic procedures I can employ to resolve this issue.

At least we can try before returning the GPU.
Cheers.

NVIDIA doesn’t have any diagnostic tests for GeForce GPUs, for end-user usage.

General troubleshooting procedures could include the following:

  • reseat the GPU in the PCIE slot
  • double-check or re-mate any aux-power connections.
  • make sure the power supply is sufficient according to manufacturer’s recommendations
  • try moving the GPU to another system, to see if the trouble follows it or not.

That scenario is not reflected in the posted output from nvidia-smi, which shows SW Power Cap : Not Active. Furthermore, 210 MHz does look like a typical frequency one might find in idle state, which is, not surprisingly, what is shown in the output: Idle : Active. So it is unclear as to what is going on, and in particular what the power consumption and operating frequency actually are when SW Power Cap is active. I would expect frequency to be lowered to somewhere in the 1200 MHz to 1500 MHz range when the power cap kicks in.

I note that the GPU is set to the default power limit of 450W. You could raise this to the maximum supported limit of 516W with nvidia-smi -pl to reduce the chances of hitting the power cap. Caveat: Your power supply is likely insufficient for applying this change.

Both high-end CPUs and high-end GPU can create significant power spikes due to changes in workload and dynamic clocking. For long-term rock-solid operation I therefore recommend that the total nominal power consumption of the system should not exceed 60% of the nominal power rating of the PSU by much. At present, you have:

CPU: 125W
GPU: 450W
DRAM: 70W
motherboard: 40W
SSD: 9W

for a total of about 695W of nominal power consumption, on a 1000W nominal PSU. That is cutting it close. Too close for my comfort. In general, I recommend 80 PLUS Platinum compliant PSUs for a high-end workstation configuration like this.

Hi,

I’ve increased limit to 516W, after that I started GPU DL training.

This is output from the nvidia-smi -q:

nvidia-smi -q 

==============NVSMI LOG==============

Timestamp                                 : Thu Dec  1 11:14:29 2022
Driver Version                            : 470.141.03
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3090 Ti
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-25dbaeb1-06fd-9cc5-edb0-af1e35453b51
    Minor Number                          : 0
    VBIOS Version                         : 94.02.A0.00.36
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220310DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x88741043
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 863000 KB/s
        Rx Throughput                     : 843000 KB/s
    Fan Speed                             : 52 %
    Performance State                     : P2
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24247 MiB
        Used                              : 23804 MiB
        Free                              : 443 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 17 MiB
        Free                              : 32751 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 100 %
        Memory                            : 36 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 44 C
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 219.06 W
        Power Limit                       : 516.00 W
        Default Power Limit               : 450.00 W
        Enforced Power Limit              : 516.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 516.00 W
    Clocks
        Graphics                          : 225 MHz
        SM                                : 225 MHz
        Memory                            : 10251 MHz
        Video                             : 1530 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 10501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 920.000 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1145
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 35 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1866
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 110 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1994
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 73 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2307
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 163 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 12321
            Type                          : C
            Name                          : python3
            Used GPU Memory               : 23401 MiB

And nvtop output is following:

I’ve using 12VHPWR connector to connect GPU to the PSU.

Training starts fine, even after Ubuntu reinstall it loads up data to memory and then starts training procedure. Afer one or two batches it just limits clock to 225 MHz or 210 MHz and keeps it that way with SW Power Cap: Active all the time.

I know that there are possible power spikes, however I firmly believe that this is due to some software related bug or unexpected behavior. GPU starts working fine and then something throttles it.

I agree that maybe 1000W 85+ Plus would be comfortable, however most of the online resources recommended 1kW ATX3 PSU for this kind of configuration, therefore I believe it should work, or it should in the worst-case-scenario limit clock on some reasonable - usable value.

Thank you for your time and support.

Just to compare, running same script on same SW configuration on PC with RTX 3090 and PSU 850W gives me following:

==============NVSMI LOG==============
Timestamp                                 : Thu Dec  1 11:38:31 2022
Driver Version                            : 470.141.03
CUDA Version                              : 11.4
Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3090
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-1313ab19-39d2-f66e-455b-eab562e85975
    Minor Number                          : 0
    VBIOS Version                         : 94.02.42.00.B4
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x87B31043
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 29000 KB/s
        Rx Throughput                     : 2227000 KB/s
    Fan Speed                             : 89 %
    Performance State                     : P2
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24259 MiB
        Used                              : 23687 MiB
        Free                              : 572 MiB
    BAR1 Memory Usageare/code/code --type=gpu-process --field-trial-handle=1937039732783260010,5543013318246478987,131072 --disable-features=CookiesWithoutSameSiteMustBeSecure,SameSiteByDefaultCookies,SpareRendererForSitePerProcess --disable-color-correct-rendering --enable-crash-reporter=fff748e8-348f-4b36-8dac-27080d10043a,no_channel --global-crash-keys=fff748e8-348f-4b36-8dac-27080d10043a,no_channel,_companyName=Microsoft,_productName=VSCode,_version=1.65.1 --user-data-dir=/home/zdenka/.config/Code --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= 
        Total                             : 256 MiB
        Used                              : 18 MiB
        Free                              : 238 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 100 %
        Memory                            : 78 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 73 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 337.61 W
        Power Limit                       : 350.00 W
        Default Power Limit               : 350.00 W
        Enforced Power Limit              : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 375.00 W
    Clocks
        Graphics                          : 1845 MHz
        SM                                : 1845 MHz
        Memory                            : 9501 MHz
        Video                             : 1605 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 1025.000 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1214
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 35 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1932
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 1621 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2059
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 85 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2367
            Type                          : G
            Name                          : /opt/teamviewer/tv_bin/TeamViewer
            Used GPU Memory               : 17 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 499006
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 13 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 620689
            Type                          : G
            Name                          : /usr/sh--shared-files
            Used GPU Memory               : 11 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 686745
            Type                          : C
            Name                          : python3
            Used GPU Memory               : 21881 MiB

So, on another PC I also have SW Power Cap: Active, however clock is not limited on such small value. :)

It is something similar to this problem. However I’m using air cooled ASUS RTX 3090Ti, so I wouldn’t like to do anything that could void warranty.

This combination of power consumption and frequency setting looks extremely weird to me, to say the least. I have never encountered anything like this. No idea what could be going on. Obviously the GPU is not even close to the power limit, so power capping should not be active. Are you seeing any error messages from the NVIDIA driver in system logs?

I noticed another thing that looks odd. It probably has nothing to do with the issue at hand, but it could indicate that the RTX 3090 Ti is installed in the wrong PCIe slot. Your board has one PCIe 4 x16 and two PCIe 4 x8 slots, correct? nvidia-smi indicates the GPU is operating on a x8 link:

        GPU Link Info
            Link Width
                Max                       : 16x
                Current                   : 8x

I am not sure whether we can learn anything from comparing the RTX 3090 behavior to the RTX 3090 Ti behavior from the data shown. These are different GPUs with different power consumption levels (100W+ difference) and therefore (presumably) different power-management characteristics. You might possibly learn something from a cross-check, by swapping the GPUs between the two systems. Does the issue follow the GPU or the system?

My warning regarding the power draw of a RTX 3090 Ti set to a limit of 516W is generic. I have seen multiple credible reports (i.e. with sensor output graph) of RTX 3090 Ti configured for the default 450W limit spiking up to the 530W to 550W range. In addition, the Intel CPU you have in the system has an Intel-specified maximum short-term boost power draw of > 200W. We have had in this forum multiple reports of spontaneous system reboots from PSUs being overwhelmed with setups similar to yours specifically when running deep learning applications, but not other workloads. I notice that all your system components appear to be premium parts. You may want to extend that build philosophy to the PSU.

That said, your current problem is clearly not operating the system at the power limit, but GPU power capping being active despite the GPU drawing far less than the power limit. Other than trying the latest NVIDIA drivers, I have no further ideas.

My (limited) understanding of GPU power management is that it uses a combination of VBIOS firmware and the GPU driver software. It is possible that there is a bug affecting one of those or with a particular firmware/driver combination. There is also a possibility of some defect in the RTX 3090 Ti, e.g. defective sensor or controller chip (I²C? Not sure what people use these days). From that perspective two possible course of action could be (1) Filing a bug report with NVIDIA (2) RMAing the card. If you have a possibility of getting a different RTX 3090 Ti (e.g. as a loaner) to try in this system, that might help figure out which course of action is the most appropriate.

I notice you are running a graphical desktop on this GPU. Can you go into the nvidia-settings control panel and check the powermizer settings:

What is your preferred mode set to?

Another thing that is odd about the posted data for the RTX 3090 Ti is the voltage:

GPU power management regulates the voltage along with the frequency, with high frequencies requiring higher voltage. I do not know the available voltage range for the RTX 3090 Ti specifically, but for recent NVIDIA GPUs it is roughly in the 0.7V to 1.1V range. At the low 215 MHz idle frequency setting we would thus expect to see around 0.7V and at the highest possible boost clock of 1987 MHz we would expect to see close to 1.1V. The output of nvidia-smi shows 0.92V, which probably corresponds to about 1650MHz -1700 MHz.

The power draw and voltage reported plus the projected frequency make me think that there could be an issue with nvidia-smi retrieving the frequency setting of the GPU. I do not know what protocol is being used to transfers the data or how exactly it is generated (firmware reading out PLL settings?) , but it seems conceivable that a raw reading of “zero” might result in the lowest possible frequency being reported by nvidia-smi. Which would mean the GPU is not actually running at 215 MHz here, it just looks that way due to missing or faulty sensor data. I emphasize that this is pure conjecture at this point.

Thank you for your comments and support.

Regarding powermizer settings, they are set to: Prefer Maximum Performance.

I’ve tried following:

  • Ubuntu 20.04 + 470 → doesn’t work
  • Ubuntu 20.04 + 515 → doesn’t work
  • Ubuntu 20.04 + 520 → doesn’t work
  • Ubuntu 22.04 + 515 → doesn’t work
  • Ubuntu 22.04 + 520 → doesn’t work

And then I tried Win 10 and started MSI Afterbuner and started ShaderToy and stuff was working. GPU was at 70% TDP, and around 1900 MHz and 65 °C for 10 minutes or so. GPU driver was newest 526. However, I reinstalled Win 11 after that and the same 215 MHz thing occured. Returned to the Win10 and again 215 MHz limit occured.

I’ve tried GPU on both PCI slots shown in the following picture, however, same thing occurs.

I don’t know how does M2 sata SSD comes into play regarding PCIe slots - because I have one too. However, at this point I think I’m going to opt out for RMA because I can’t make sense out of this. (especially not working - win 10 working - not working - win 10 not working)

I really don’t know what else can I do, nor I’m finding similar problems others have.

Thank you for your help :)

The fact that it was apparently working as desired on Win10 one moment, but not a short while later is indeed strange and does make it look like there could be some sort of intermittent issue with the hardware. You made sure the software was identical both times (working / not working), correct? Controlled experiments where only one variable changes at any one time are super important when trying to diagnose things like this.

As I said, the PCIe slot issue is orthogonal to your clocking issue. It seems to me the GPU should be (for general performance reasons) in the silver-lined slot, as it is labelled “PCIe x16 (G5)”. So this seems to be a PCIe 3/4/5 x16 slot. The other slot seems to be just PCIe 3 x16 based on its labelling, while the RTX 3090 Ti supports PCIe 4.

If you manage to resolve this issue in the near future, it would be nice to read here what the solution turned out to be, so we can all learn for the future.

As for the SSD, based on the motherboard’s specs on the ASUS website, there appear to be three dedicated slots for SSDs on this board (M.2_1, M.2_2, M.2_3), all of which seem to support PCIe 4.0 x4 mode, so I am not sure how one would pick one over the other.

I’ll move the GPU to the silver lined PCIe x16 (G5) slot, however it was seated there when the limiting clock started occuring, and I don’t think any PCI slot would affect clock so greatly.

At this point control variable was OS + drivers (clean OS install, than one drivers, than uninstall, reboot, than another etc…). The situation where GPU works with reported and expected performance makes me doubt that clock limiting is a hardware issue.

However, GPU worked with clean Win10 install without any updates or ASUS drivers and slightly older edition (I don’t know which exactly). Win 11 with updates, or win 10 with updates, or win 10 with asus drivers, and win 11 with asus drivers limits clock.

I’m suspecting MB drivers, or some other power capping feature.

At this point I’m basically wondering if it is a hardware issue (could be, especially regarding post where guy reports limited clock due to the overheating, however, GPU doesn’t overheat) or some software bug which somehow triggers clock limiting due to specific HW parts combination.

All in all, thank you all for help, I’ll try to check connectors once more, swap PCI slot and try test one more time. If it fails, I’ll probably go for RMA because I don’t have spare similar GPU than i can reliably test clock power capping. I can try with GTX 1080Ti but it has quite lower power consumption…

I’ll let you know end result for sure :)

Looking at the specs and manual for the motherboard, only the PCIe x16 (G5) slot supports the full x16 lanes. The others only x4 lanes and this can be seen in the photo above, where the G3 slot only has contacts visible in about a third of the slot.

On this board, the M2 slots seem to be on their own dedicated lanes, unlike some where lanes are taken from PCIe slots when an SSD is used. Ideally you want it in the M.2_1 slot, which puts it directly on the CPU PCIe lanes.

Edit: To remove incorrect PCIe lane utilisation info.

Hi guys,

I’ve figured it out.

It was due to “bad” 12VHPWR connector on the GPU.

When I apply small pressure with my hand in the direction arrow on the picture below shows, you can hear click and
the GPU clock returns back to normal. If you don’t apply small pressure, GPU clock remains at 220 MHz.

I’ll opt for RMA because that’s not expected behavior.

Thank you everyone for the help :)

Thanks for unraveling the mystery!

So it was an intermittent hardware issue after all. It makes perfect sense that the GPU clocks down to minimum speed if it can only get power through the PCIe slot (which is officially rated for a maximum of 75W, but most NVIDIA GPU limit themselves to around 40W).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.