1080Ti stuck at idle clock frequency even at 100% GPU utilization

My setup:

GTX 1080 Ti, 390.77, Ubuntu Server 16.04 (headless)

The system is used to train neural networks, mostly tensorflow. I recently discovered that even when running the training process for multiple hours and nvidia-smi reports a gpu utilization of well over 90%, the power consumption (as reported by nvidia-smi) never exceeds about 42 Watts.
At first I thought that the power consumption info might not be reliable, but it turned out that another system equipped with a single 1050Ti outperformed the 1080Ti almost by a factor of 2.

For further diagnosis, I’m running a different process (“ethminer”) that reliably uses 100% gpu all the time. Results where exactly the same with other gpu-intensitve tasks however.

For reference, the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   42C    P5    42W / 250W |   2885MiB / 11176MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3377      C   ethminer                                    2875MiB |
+-----------------------------------------------------------------------------+

Going through the output of nvidia-smi -q, there were a few things standing out to me:
(The GPU utilization was still at 100% at the time nvidia-smi -q was executed)

Performance State               : P5
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
  • "Performance State" is P5
  • "SW Power Cap" is "Active"
  • I see no reason why “SW Power Cap” would be active:

    Power Readings
            Power Management            : Supported
            Power Draw                  : 42.95 W
            Power Limit                 : 250.00 W
            Default Power Limit         : 250.00 W
            Enforced Power Limit        : 250.00 W
            Min Power Limit             : 125.00 W
            Max Power Limit             : 300.00 W
    

    The cause for the bad performance seems to be that the clocks all run at idle speed:

    Clocks
            Graphics                    : 139 MHz
            SM                          : 139 MHz
            Memory                      : 810 MHz
            Video                       : 734 MHz
        Applications Clocks
            Graphics                    : N/A
            Memory                      : N/A
        Default Applications Clocks
            Graphics                    : N/A
            Memory                      : N/A
        Max Clocks
            Graphics                    : 1936 MHz
            SM                          : 1936 MHz
            Memory                      : 5505 MHz
            Video                       : 1620 MHz
        Max Customer Boost Clocks
            Graphics                    : N/A
        Clock Policy
            Auto Boost                  : N/A
            Auto Boost Default          : N/A
    

    The complete output of nvidia-smi -q is part of the link to the bug report below, it starts at line 5011.

    Steps I already tried to solve the problem, without success:

  • rebooted the system
  • uninstalled the nvidia driver (via the runfile; --uninstall) and installed it again
  • installed the nvidia driver 396.18 (beta) from http://www.nvidia.com/drivers/results/133571
  • Waiting: at least 36 hours at 100% gpu utilization
  • I also tried running a X server in the background and changing parameters in /etc/x11/xorg.conf, but none of the combination I tried so far had any effect.

    When I reboot the system or stop all processes using the gpu, nvidia-smi reports a performance state of P8 and the power consumption is about 20 Watts. “SW Power Cap” however is still active and the clock rate is also 139 MHz.

    I looked around the internet and only found one post where someone had a similar situation, but it didn’t get much attention:
    https://www.kaggle.com/c/data-science-bowl-2018/discussion/50444

    I ran sudo nvidia-bug-report.sh and uploaded the output here: https://pastebin.com/yhXxLFPf

    It’d be great if someone has experienced similar problems or at least has an idea what I can do to get the expected performance out of this GPU.

    Basti

    nvidia-bug-report.log.gz (102 KB)

    Things to try:

    • Update bios
    • use a different slot

    Thanks for your reply.

    I updated the BIOS to the latest version and even restored the factory default settings, but the behavior is still the same.
    The mainboard I use has only one PCIe x16 slot. There is one additional x8 Slot, but I couldn’t fit the GPU in there due to space limitations.

    Is there anything else I could try?

    See if the persistence daemon is started, if not, start it.
    If that doesn’t help, modify the nvidia-persistenced.service file to run as root and use the persistence mode (i.e. remove the options --no-persistence-mode --user ).

    Starting the persistence daemon fixed the problem - Thanks so much!

    Basti