My setup:
GTX 1080 Ti, 390.77, Ubuntu Server 16.04 (headless)
The system is used to train neural networks, mostly tensorflow. I recently discovered that even when running the training process for multiple hours and nvidia-smi reports a gpu utilization of well over 90%, the power consumption (as reported by nvidia-smi) never exceeds about 42 Watts.
At first I thought that the power consumption info might not be reliable, but it turned out that another system equipped with a single 1050Ti outperformed the 1080Ti almost by a factor of 2.
For further diagnosis, I’m running a different process (“ethminer”) that reliably uses 100% gpu all the time. Results where exactly the same with other gpu-intensitve tasks however.
For reference, the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 30% 42C P5 42W / 250W | 2885MiB / 11176MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3377 C ethminer 2875MiB |
+-----------------------------------------------------------------------------+
Going through the output of nvidia-smi -q, there were a few things standing out to me:
(The GPU utilization was still at 100% at the time nvidia-smi -q was executed)
Performance State : P5
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
I see no reason why “SW Power Cap” would be active:
Power Readings
Power Management : Supported
Power Draw : 42.95 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
The cause for the bad performance seems to be that the clocks all run at idle speed:
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 810 MHz
Video : 734 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1936 MHz
SM : 1936 MHz
Memory : 5505 MHz
Video : 1620 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
The complete output of nvidia-smi -q is part of the link to the bug report below, it starts at line 5011.
Steps I already tried to solve the problem, without success:
I also tried running a X server in the background and changing parameters in /etc/x11/xorg.conf, but none of the combination I tried so far had any effect.
When I reboot the system or stop all processes using the gpu, nvidia-smi reports a performance state of P8 and the power consumption is about 20 Watts. “SW Power Cap” however is still active and the clock rate is also 139 MHz.
I looked around the internet and only found one post where someone had a similar situation, but it didn’t get much attention:
https://www.kaggle.com/c/data-science-bowl-2018/discussion/50444
I ran sudo nvidia-bug-report.sh and uploaded the output here: https://pastebin.com/yhXxLFPf
It’d be great if someone has experienced similar problems or at least has an idea what I can do to get the expected performance out of this GPU.
Basti
nvidia-bug-report.log.gz (102 KB)