HW Power Brake Slowdown

Hello!
We have a problem when using Tesla V100, there seems to be something that limits the Power of our GPU and make it slow.
For example, when we load a program on it, the “GPU-Util”(learn from Nvidia-smi) can achieve 100% but the “Pwr:Usage/Cap”(also learn from Nvidia-smi) is always below 100w/250w. Then we found that “HW Power Brake Slowdown”(nvidia-smi -q)is active and “Graphics Clock” is only about 300MHZ.
How can we fix this problem and speed up our V100?
Hardware Info: HPE DL388 GEN9 with Intel® Xeon® CPU E5-2620 v4 @ 2.10GHz, 128G mem, V100x2

(1) Did you purchase this server from HPE with two Tesla V100 already installed?
(2) Can you post the complete output of nvidia-smi -q while you are running a CUDA-accelerated app?
(3) The default power supply for this server seems to be 500W. What is the wattage of the power supply you have actually installed?

(1) No

(3) 1400W

(2)
==============NVSMI LOG==============

Timestamp : Tue Apr 2 18:11:25 2019
Driver Version : 410.104
CUDA Version : 10.0

Attached GPUs : 2
GPU 00000000:08:00.0
Product Name : Tesla V100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321218057205
GPU UUID : GPU-4d45ba1b-fdf9-54c5-d26f-e6e529c2b8ec
Minor Number : 0
VBIOS Version : 88.00.01.2D.03
MultiGPU Board : No
Board ID : 0x800
GPU Part Number : 900-2G500-2700-300
Inforom Version
Image Version : G500.0200.00.03
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x08
Device : 0x00
Domain : 0x0000
Device Id : 0x1DB410DE
Bus Id : 00000000:08:00.0
Sub System Id : 0x121410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 174000 KB/s
Rx Throughput : 1178000 KB/s
Fan Speed : 100 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16130 MiB
Used : 5962 MiB
Free : 10168 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 11 MiB
Free : 245 MiB
Compute Mode : Default
Utilization
Gpu : 97 %
Memory : 14 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 49 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 103.59 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 50.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 360 MHz
SM : 360 MHz
Memory : 877 MHz
Video : 1290 MHz
Applications Clocks
Graphics : 1028 MHz
Memory : 877 MHz
Default Applications Clocks
Graphics : 1028 MHz
Memory : 877 MHz
Max Clocks
Graphics : 1455 MHz
SM : 1455 MHz
Memory : 877 MHz
Video : 1312 MHz
Max Customer Boost Clocks
Graphics : 1455 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 3063
Type : C
Name : python
Used GPU Memory : 4681 MiB
Process ID : 12954
Type : C
Name : python
Used GPU Memory : 1269 MiB

GPU 00000000:84:00.0
Product Name : Tesla V100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0333245435243
GPU UUID : GPU-88be8e63-dd61-3d8f-2f44-e1f93204654d
Minor Number : 1
VBIOS Version : 88.00.01.2D.03
MultiGPU Board : No
Board ID : 0x8400
GPU Part Number : 900-2G500-2700-300
Inforom Version
Image Version : G500.0200.00.03
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x84
Device : 0x00
Domain : 0x0000
Device Id : 0x1DB410DE
Bus Id : 00000000:84:00.0
Sub System Id : 0x121410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 200000 KB/s
Rx Throughput : 1174000 KB/s
Fan Speed : 96 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16130 MiB
Used : 10623 MiB
Free : 5507 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 7 MiB
Free : 249 MiB
Compute Mode : Default
Utilization
Gpu : 59 %
Memory : 6 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 38 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 51.60 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 50.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 356 MHz
SM : 356 MHz
Memory : 877 MHz
Video : 1282 MHz
Applications Clocks
Graphics : 1028 MHz
Memory : 877 MHz
Default Applications Clocks
Graphics : 1028 MHz
Memory : 877 MHz
Max Clocks
Graphics : 1455 MHz
SM : 1455 MHz
Memory : 877 MHz
Video : 1312 MHz
Max Customer Boost Clocks
Graphics : 1455 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 8743
Type : C
Name : python3
Used GPU Memory : 10565 MiB

Did you install the two Tesla V100 into this system yourself?
If so, did you follow all steps described in the HPE ProLiant DL388 Gen9 Server User Guide?

According to HPE documentation, the default power supply for the ProLiant DL388 Gen9 seems to be a 500W 80 PLUS Platinum compliant PSU. This is not nearly enough to supply the server with two Tesla V100 GPUs.
What PSU are you actually using in this system (wattage, other specifications)?

My working hypothesis is that the electrical power supply to the GPUs is insufficient, causing either the system to assert the power brake, or the GPU applying it. For proper operation you need a sufficiently sized power supply (I would suggest >= 1000W for your setup) and you need to make sure all the necessary power cables are plugged into each GPU.

Note that Tesla GPUs are supposed to be sold already integrated into machines by system vendors that partner with NVIDIA; they are not designed as end-user installable items.

Thanks for your reply!

First, I’ve checked our power supply for this server, is 1400W.

Then, I didn’t install the two Tesla V100 into this system myself. When the server was sent to our company, the 2 Tesla V100s have not been installed into this system, and I saw the engineer of the seller put them into it.

Appreciate again for your help.

I would take this issue up with the seller or HPE. The system is misconfigured. This is not a power supply issue per-se. The power brake is a separate input to the GPU, and is mostly unique to HPE servers. I don’t intend to sort this out over the web.

The seller should not have sold you the system this way.

If you don’t have any luck with the seller, contact HPE support directly, discuss the issue with them, and ask them for a support ticket number. If they are not able to resolve the issue for you, contact me using the private messaging service on this board/forum, and supply me with the HPE ticket number. I will be asking for quite a few other details as well.

If you don’t have the HPE support ticket number, I won’t be able to help you.