RTX 2080 Ti always Power Cap and low utilization

Hi all,
Looking for help with my setup,

I have a brand new SuperMicro server with 4 RTX 2080 Ti
Ubuntu 18.04 + CUDA 10.0 + nvidia 410.79 (also tryied with 415 driver version)

The issue is utilization about 60%-70% and performance at P2

as i can see there is always “SW Power Cap : Active” on all gpu’s
also tryied 2080Ti from another vendor, the same problem “SW Power Cap”, power limit also not help

the server hadware is pretty top, including 2 PSU 2200W each

some nvidia-smi outputs:

==============NVSMI LOG==============

Timestamp                           : Thu Feb 28 15:11:57 2019
Driver Version                      : 410.79
CUDA Version                        : 10.0

Attached GPUs                       : 1
GPU 00000000:86:00.0
    Product Name                    : GeForce RTX 2080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-aa9365b0-01d8-884f-08fc-62515a8a21a8
    Minor Number                    : 0
    VBIOS Version                   : 90.02.17.00.C9
    MultiGPU Board                  : No
    Board ID                        : 0x8600
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.02.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x86
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1E0410DE
        Bus Id                      : 00000000:86:00.0
        Sub System Id               : 0x12AE10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 14000 KB/s
        Rx Throughput               : 45000 KB/s
    Fan Speed                       : 61 %
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 10989 MiB
        Used                        : 10868 MiB
        Free                        : 121 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 8 MiB
        Free                        : 248 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 69 %
        Memory                      : 53 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
        Aggregate
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 83 C
        GPU Shutdown Temp           : 94 C
        GPU Slowdown Temp           : 91 C
        GPU Max Operating Temp      : 89 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 227.85 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 280.00 W
    Clocks
        Graphics                    : 1770 MHz
        SM                          : 1770 MHz
        Memory                      : 6800 MHz
        Video                       : 1635 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 2100 MHz
        SM                          : 2100 MHz
        Memory                      : 7000 MHz
        Video                       : 1950 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 4540
            Type                    : C
            Name                    : /usr/bin/python3
            Used GPU Memory         : 10857 MiB

any ideas why the Power Cap occurs and reducing performance ?
appreciate for any help

Regards,
Ilya

nvidia-bug-report.log.gz (1.14 MB)

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Didn’t notice you attached the log.
I don’t know why and whether this has any influence on your issue, the persistenced is continuously starting and stopping:

Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1915]: Verbose syslog connection opened
Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1915]: Started (1915)
Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1913]: Received signal 15
Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1913]: Shutdown (1915)
Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1915]: Received signal 15
Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1915]: PID file unlocked.
Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1915]: PID file closed.
Feb 28 14:43:31 rbc-gpu nvidia-persistenced[1915]: Shutdown (1915)
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: Verbose syslog connection opened
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: Started (2284)
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: device 0000:86:00.0 - registered
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: device 0000:86:00.0 - persistence mode enabled.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: device 0000:86:00.0 - NUMA memory onlined.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: Local RPC services initialized
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: Received signal 15
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: Socket closed.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: device 0000:86:00.0 - persistence mode disabled.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: device 0000:86:00.0 - NUMA memory offlined.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: PID file unlocked.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: PID file closed.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2284]: Shutdown (2284)
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: Verbose syslog connection opened
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: Started (2299)
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: device 0000:86:00.0 - registered
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: device 0000:86:00.0 - persistence mode enabled.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: device 0000:86:00.0 - NUMA memory onlined.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: Local RPC services initialized
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: Received signal 15
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: Socket closed.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: device 0000:86:00.0 - persistence mode disabled.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: device 0000:86:00.0 - NUMA memory offlined.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: PID file unlocked.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: PID file closed.
Feb 28 14:43:48 rbc-gpu nvidia-persistenced[2299]: Shutdown (2299)

Can you check if disabling it changes anything?

same results
low utilization

do we have any update here, I’m also suffering the same problem

Please make sure nvidia-persistenced is continuously running.

Hello,

i am running into nearly the same problem on an Alienware m15 laptop with an RTX 2080 Max-Q with ubuntu 18.04 and 18.10. I’ve tried many different drivers, including referenced 418.43 (officially supported). No matter what cuda process i run, the GPU pwr usage/cap will not exceed low 40+ watts. Simple models that run fine on a GTX 1080 ti (in fact i can run 6-8 at the same time) completely max out the 2080 GPU.

I’ve tried everything, disabling all CPU throttling, forcing the maximum GPU power profile and nothing seems to work. I’ve tried the recommended nvidia-persistenced, and it made no difference.

The primary application i run is a computer vision face detector built on CUDA. We run 6-10 streams no problem on a GTX 1080 TI. I cannot even run a single stream on this RTX 2080 without hitting max gpu utilization (i know there is something wrong).

I’ve literally restagged the entire laptop 5 times, with every variation of Ubuntu PPA drivers, drivers directly from nvidia (runfile), and nothing has worked.

To add even more mystery, i attached an eGPU with a GTX 1080 TI, and somehow the RTX 2080 boosted to 80+watts and everything started working fine. I then detached the eGPU and this worked for many hours - it’s almost like by connecting the thunderbolt 3 egpu a setting change occurred and things started “magically” working. I rebooted the system and it continued to work. While the app was running, i unplugged the laptop from the power supply and immediately the GPU voltage dropped (which was expected). After this i plugged the power back in and rebooted the laptop. Some how the sluggish behavior returned and i’ve never seen then GTX 2080 Max-Q go above 40Watts again, even though the GPU utilization hits 100%. I’ve tried reattaching the eGPU (which works fine btw) and nothing seems to help.

Is there an particular reason that might impact GPU performance that i am missing? I’ve ruled out temperature being the issue. If i had not seen our software work for several hours i’d believe the card is simply bad or the new architecture isn’t compatible, but i saw it work for many hours which leads me to believe there is a configuration problem or something else that’s limiting the performance of the card. I’ve attached the nvidia-smi -q output below:

root@m15:~# nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Mon Jul 15 05:32:36 2019
Driver Version : 418.43
CUDA Version : 10.1

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce RTX 2080 with Max-Q Design
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-281023f6-69d6-6d90-09f5-8e3872f46825
Minor Number : 0
VBIOS Version : 90.04.3B.00.8E
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1E9010DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x08A11028
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 2000 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 7952 MiB
Used : 7248 MiB
Free : 704 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 10 MiB
Free : 246 MiB
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 4 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 50 C
GPU Shutdown Temp : 99 C
GPU Slowdown Temp : 94 C
GPU Max Operating Temp : 87 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : N/A
Power Draw : 40.97 W
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 247 MHz
SM : 247 MHz
Memory : 6000 MHz
Video : 990 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 6001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : 2100 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1833
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 380 MiB
Process ID : 2017
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 252 MiB
Process ID : 2498
Type : G
Name : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=2962196500061362648,8468912491728292362,131072 --enable-crash-reporter=6c9e4385-3542-4db8-af98-baa911bfa25a, --gpu-preferences=KAAAAAAAAAAgAAAgAQAAAAAAAAAAAGAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAA --service-request-channel-token=12929593306769725892
Used GPU Memory : 155 MiB
Process ID : 25677
Type : C
Name : ./gpu_burn
Used GPU Memory : 6447 MiB

nvidia-bug-report.log.gz (1.09 MB)

marc.sanpedro, this looks like either a bios or driver bug. The gpu does not throttle up, staying at the lowest clocks. Please check for a bios upgrade, check if this still happening with the current 430.34 driver. If it doesn’t help, please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

I’ve updated to the latest bios provided by Dell/Alienware. I’ve turned off all BIOS power limiting features. I’ve tried previously on all version of the driver including 430.34 and same result. nvidia-bug-report was attached to previous comment.

Apart from a driver bug, I can only suspect a Wayland session was started for gdm, at least I can see only the user Xsession. Are you using GDM or did you change it to a different DM?
If using GDM, please check this: https://askubuntu.com/questions/975094/how-to-disable-wayland-in-17-10-in-gdm3-login-screen

I followed the instructions in the link to make sure and this had no impact. Do you recommend i use a different driver? I’ve tried 430.34 makes no difference. ?

Since the 2080 Max-Q was only added with driver 418.30 you can’t really use another driver you didn’t already try. Please send the nvidia-bug-report.log with a description of the bug to linux-bugs[at]nvidia.com for additional attention.

I’ve sent the bug to linux-bugs[at]nvidia.com. Once there’s resolution (fingers crossed) i’ll update this thread.

I had the same issue with the Alienware M15 2080 max-q.
Just updated the bios that was released on 25 July 2019 and now the card seems to operate correctly.
I installed Cuda 10.1 from via .deb from the Nvidia site. (driver version 418.87.00)

Same issue here with Alienware M17 and RTC 2070 max-q. The GPU running slow most of the times but sometimes running at 5x the usual “slow” speed.

Seemed fixed with the latest BIOS update (V. 2.2.1, released on 12/sept/2019) and Nvidia driver 435.21 but after I unplugged the power and plugged back in, it went back to running slow.

Rebooting is not effective, a work around seems to be logging into windows and then back into Ubuntu