1080Ti stuck at idle clock frequency even at 100% GPU utilization

bastid4tqx · July 23, 2018, 5:06pm

My setup:

GTX 1080 Ti, 390.77, Ubuntu Server 16.04 (headless)

The system is used to train neural networks, mostly tensorflow. I recently discovered that even when running the training process for multiple hours and nvidia-smi reports a gpu utilization of well over 90%, the power consumption (as reported by nvidia-smi) never exceeds about 42 Watts.
At first I thought that the power consumption info might not be reliable, but it turned out that another system equipped with a single 1050Ti outperformed the 1080Ti almost by a factor of 2.

For further diagnosis, I’m running a different process (“ethminer”) that reliably uses 100% gpu all the time. Results where exactly the same with other gpu-intensitve tasks however.

For reference, the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   42C    P5    42W / 250W |   2885MiB / 11176MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3377      C   ethminer                                    2875MiB |
+-----------------------------------------------------------------------------+

Going through the output of nvidia-smi -q, there were a few things standing out to me:
(The GPU utilization was still at 100% at the time nvidia-smi -q was executed)

Performance State               : P5
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active

"Performance State" is P5

"SW Power Cap" is "Active"

I see no reason why “SW Power Cap” would be active:

Power Readings
        Power Management            : Supported
        Power Draw                  : 42.95 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W

The cause for the bad performance seems to be that the clocks all run at idle speed:

Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 810 MHz
        Video                       : 734 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1936 MHz
        SM                          : 1936 MHz
        Memory                      : 5505 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A

The complete output of nvidia-smi -q is part of the link to the bug report below, it starts at line 5011.

Steps I already tried to solve the problem, without success:

rebooted the system

uninstalled the nvidia driver (via the runfile; --uninstall) and installed it again

installed the nvidia driver 396.18 (beta) from http://www.nvidia.com/drivers/results/133571

Waiting: at least 36 hours at 100% gpu utilization

I also tried running a X server in the background and changing parameters in /etc/x11/xorg.conf, but none of the combination I tried so far had any effect.

When I reboot the system or stop all processes using the gpu, nvidia-smi reports a performance state of P8 and the power consumption is about 20 Watts. “SW Power Cap” however is still active and the clock rate is also 139 MHz.

I looked around the internet and only found one post where someone had a similar situation, but it didn’t get much attention:
https://www.kaggle.com/c/data-science-bowl-2018/discussion/50444

I ran sudo nvidia-bug-report.sh and uploaded the output here: https://pastebin.com/yhXxLFPf

It’d be great if someone has experienced similar problems or at least has an idea what I can do to get the expected performance out of this GPU.

Basti

nvidia-bug-report.log.gz (102 KB)

generix · July 23, 2018, 8:02pm

Things to try:

Update bios
use a different slot

bastid4tqx · July 25, 2018, 4:16pm

Thanks for your reply.

I updated the BIOS to the latest version and even restored the factory default settings, but the behavior is still the same.
The mainboard I use has only one PCIe x16 slot. There is one additional x8 Slot, but I couldn’t fit the GPU in there due to space limitations.

Is there anything else I could try?

generix · July 25, 2018, 5:25pm

See if the persistence daemon is started, if not, start it.
If that doesn’t help, modify the nvidia-persistenced.service file to run as root and use the persistence mode (i.e. remove the options --no-persistence-mode --user ).

bastid4tqx · July 25, 2018, 7:00pm

Starting the persistence daemon fixed the problem - Thanks so much!

Basti

Topic		Replies	Views
Idle power usage stuck at 10-20watts after running an app GPU - Hardware power , linux , driver , nvidia-smi	8	6639	October 17, 2022
SW Power Cap always Active CUDA Programming and Performance	8	9899	March 1, 2025
nvidia-smi Volatile GPU-Util 100%, always, reboot operating system can not fix CUDA Setup and Installation	6	11189	November 30, 2020
nvidia-smi is slow and hangs after sometime with 1080Ti CUDA Setup and Installation	4	6728	June 20, 2018
Idle power usage problem [GTX 1060 6GB] Linux	8	5171	December 28, 2018
1080ti core clock stays at idle speed with p0 and SW Cap active only with DisplayPort General Topics and Other SDKs	2	19	April 3, 2025
Why is the actual power much greater than the maximum power Linux	8	1186	July 5, 2022
Headless linux Rtx 2060 never enters Idle Linux power	12	1390	March 30, 2020
High idle power draw RTX 2070 Super Linux	15	5494	March 1, 2025
RTX 2080 Ti always Power Cap and low utilization Linux	14	4650	November 6, 2019

1080Ti stuck at idle clock frequency even at 100% GPU utilization

Related topics