Hello,
I have a few server where Enforced Power limit are being degrade and some where the Enforced Power Limit are set at max. Out of about 30 server, 10 of the enforced power limit at set for 350.00.
OS: Red Hat Enterprise Linux release 8.3 (Ootpa)
Trying to figure out why?
I will paste the complete the nvidia-smi -a out for these 2 servers.
Power Readings
Power Management : Supported
Power Draw : 83.64 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Power Readings
Power Management : Supported
Power Draw : 78.20 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Nvidia-smi -a for the working server Enforced power 400w
==============NVSMI LOG==============
Timestamp : Mon Jul 19 15:34:49 2021
Driver Version : 460.73.01
CUDA Version : 11.2
Attached GPUs : 8
GPU 00000000:07:00.0
Product Name : A100-SXM-80GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560221014681
GPU UUID : GPU-9dac6767-3c33-d879-fafe-e52c241896f3
Minor Number : 2
VBIOS Version : 92.00.36.00.01
MultiGPU Board : No
Board ID : 0x700
GPU Part Number : 692-2G506-0210-002
Inforom Version
Image Version : G506.0210.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x07
Device : 0x00
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:07:00.0
Sub System Id : 0x146310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81251 MiB
Used : 0 MiB
Free : 81251 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 25 MiB
Free : 131047 MiB
Compute Mode : Exclusive_Process
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 33 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 49 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 79.49 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 1380 MHz
SM : 1380 MHz
Memory : 1593 MHz
Video : 1245 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:0B:00.0
Product Name : A100-SXM-80GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560421017230
GPU UUID : GPU-d1e0deee-4d00-c73a-a2be-39af80eecc42
Minor Number : 3
VBIOS Version : 92.00.36.00.01
MultiGPU Board : No
Board ID : 0xb00
GPU Part Number : 692-2G506-0210-002
Inforom Version
Image Version : G506.0210.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x0B
Device : 0x00
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:0B:00.0
Sub System Id : 0x146310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81251 MiB
Used : 0 MiB
Free : 81251 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 13 MiB
Free : 131059 MiB
Compute Mode : Exclusive_Process
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 32 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 48 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 78.20 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 1380 MHz
SM : 1380 MHz
Memory : 1593 MHz
Video : 1245 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:48:00.0
Product Name : A100-SXM-80GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560421016803
GPU UUID : GPU-ede66894-4716-6697-0d5d-4abc69a565c7
Minor Number : 0
VBIOS Version : 92.00.36.00.01
MultiGPU Board : No
Board ID : 0x4800
GPU Part Number : 692-2G506-0210-002
Inforom Version
Image Version : G506.0210.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x48
Device : 0x00
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:48:00.0
Sub System Id : 0x146310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81251 MiB
Used : 0 MiB
Free : 81251 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 21 MiB
Free : 131051 MiB
Compute Mode : Exclusive_Process
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 32 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 49 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 82.12 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 1380 MHz
SM : 1380 MHz
Memory : 1593 MHz
Video : 1245 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
Nvidia-smi -a for the none working 350W server
==============NVSMI LOG==============
Timestamp : Mon Jul 19 14:51:27 2021
Driver Version : 460.73.01
CUDA Version : 11.2
Attached GPUs : 8
GPU 00000000:07:00.0
Product Name : A100-SXM-80GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560421009991
GPU UUID : GPU-8a9b0ba3-4dd6-f6be-99f1-7170f479482a
Minor Number : 2
VBIOS Version : 92.00.36.00.01
MultiGPU Board : No
Board ID : 0x700
GPU Part Number : 692-2G506-0210-002
Inforom Version
Image Version : G506.0210.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x07
Device : 0x00
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:07:00.0
Sub System Id : 0x146310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81251 MiB
Used : 0 MiB
Free : 81251 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 237 MiB
Free : 130835 MiB
Compute Mode : Exclusive_Process
Utilization
Gpu : 59 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 35 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 50 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 83.64 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1275 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:0B:00.0
Product Name : A100-SXM-80GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560421010244
GPU UUID : GPU-5fbc191a-1afc-7546-8f0d-905c0fd8551b
Minor Number : 3
VBIOS Version : 92.00.36.00.01
MultiGPU Board : No
Board ID : 0xb00
GPU Part Number : 692-2G506-0210-002
Inforom Version
Image Version : G506.0210.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x0B
Device : 0x00
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:0B:00.0
Sub System Id : 0x146310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81251 MiB
Used : 0 MiB
Free : 81251 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 189 MiB
Free : 130883 MiB
Compute Mode : Exclusive_Process
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 34 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 49 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 83.48 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1275 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None