Hello ,
I have a GPUnode in a bright cluster HPC is not going over 100W in power consumption, I have compared with other nodes that do not have this issue. When ever a user is submitting a job in this node it is considerably slows down training.
After I compared the nodes , I saw :
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
I am wondering if this is the root cause and I can switch this parameter or this is for monitoring and power distribution or problem is causing this flag to be active.
Find bellow the output of #nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Wed Jun 14 13:36:11 2023
Driver Version : 510.47.03
CUDA Version : 11.6
Attached GPUs : 4
GPU 00000000:01:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324320083776
GPU UUID : GPU-5ed5ab98-4097-f26f-9640-4fa66c6d2d32
Minor Number : 1
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : 692-2G506-0202-002
Module ID : 3
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 28 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 48.00 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Processes : None
GPU 00000000:41:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324320083988
GPU UUID : GPU-1751199b-e7fd-d36a-3a17-07004d07b073
Minor Number : 0
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0x4100
GPU Part Number : 692-2G506-0202-002
Module ID : 2
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x41
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:41:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 30 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 52.66 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 700.000 mV
Processes : None
GPU 00000000:81:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324420013220
GPU UUID : GPU-7307a669-34c8-3b74-b7cd-07797afe0962
Minor Number : 3
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0x8100
GPU Part Number : 692-2G506-0202-002
Module ID : 1
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x81
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:81:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 25 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 26 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 49.52 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Processes : None
GPU 00000000:C1:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324320083551
GPU UUID : GPU-a814861d-efb7-f636-2f39-e785a014bc0c
Minor Number : 2
VBIOS Version : 92.00.19.00.13
MultiGPU Board : No
Board ID : 0xc100
GPU Part Number : 692-2G506-0202-002
Module ID : 0
Inforom Version
Image Version : G506.0202.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xC1
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:C1:00.0
Sub System Id : 0x144E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 423 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 1
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 1 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 25 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 50.38 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 52 MHz
SM : 52 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 718.750 mV
Processes : None