Power consumption of GPUs does not go above 100W - nvidia-smi

Hello ,

I have a GPUnode in a bright cluster HPC is not going over 100W in power consumption, I have compared with other nodes that do not have this issue. When ever a user is submitting a job in this node it is considerably slows down training.

After I compared the nodes , I saw :

HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active

I am wondering if this is the root cause and I can switch this parameter or this is for monitoring and power distribution or problem is causing this flag to be active.

Find bellow the output of #nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Wed Jun 14 13:36:11 2023
Driver Version : 510.47.03
CUDA Version : 11.6

Attached GPUs : 4
GPU 00000000:01:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324320083776
GPU UUID : GPU-5ed5ab98-4097-f26f-9640-4fa66c6d2d32
Minor Number : 1
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : 692-2G506-0202-002
Module ID : 3
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 28 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 48.00 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Processes : None

GPU 00000000:41:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324320083988
GPU UUID : GPU-1751199b-e7fd-d36a-3a17-07004d07b073
Minor Number : 0
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0x4100
GPU Part Number : 692-2G506-0202-002
Module ID : 2
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x41
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:41:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 30 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 52.66 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 700.000 mV
Processes : None

GPU 00000000:81:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324420013220
GPU UUID : GPU-7307a669-34c8-3b74-b7cd-07797afe0962
Minor Number : 3
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0x8100
GPU Part Number : 692-2G506-0202-002
Module ID : 1
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x81
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:81:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 25 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 26 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 49.52 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Processes : None

GPU 00000000:C1:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324320083551
GPU UUID : GPU-a814861d-efb7-f636-2f39-e785a014bc0c
Minor Number : 2
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0xc100
GPU Part Number : 692-2G506-0202-002
Module ID : 0
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xC1
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:C1:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 1
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 1 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 25 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 50.38 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 718.750 mV
Processes : None

This means the server is signalling to that GPU that it must throttle itself. The GPU is responding by limiting its power consumption, by reducing its clocks.

This is a function of the server design, and concerns about it should be raised with your server vendor. It’s not under the control of NVIDIA and it can’t be resolved or sorted out using any software method or anything that is controllable by the user. There isn’t anything that can be done via this forum to further address the issue.

Thank you for the clarification.