mlx5_pcie_event:Detected insufficient power on the PCIe slot (27W)

After I restarted the OFED driver using the command (sudo /etc/init.d/openibd restart ), the kernel log displayed the following information:

mlx5_pcie_event:301:(pid 21676): Detected insufficient power on the PCIe slot (27W).
mlx5_core 0000:42:00.0: poll_health:971:(pid 0): device's health compromised - reached miss count
mlx5_core 0000:42:00.0: print_health_info:492:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[0] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[1] 0xbadc0ffe
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[2] 0x00000001
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[3] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[4] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[5] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:498:(pid 0): assert_exit_ptr 0x2088e898
mlx5_core 0000:42:00.0: print_health_info:499:(pid 0): assert_callra 0x20890804
mlx5_core 0000:42:00.0: print_health_info:501:(pid 0): fw_ver 22.30.1004
mlx5_core 0000:42:00.0: print_health_info:502:(pid 0): time 0
mlx5_core 0000:42:00.0: print_health_info:503:(pid 0): hw_id 0x00000212
mlx5_core 0000:42:00.0: print_health_info:504:(pid 0): rfr 0
mlx5_core 0000:42:00.0: print_health_info:505:(pid 0): severity 3 (ERROR)
mlx5_core 0000:42:00.0: print_health_info:506:(pid 0): irisc_index 7
mlx5_core 0000:42:00.0: print_health_info:508:(pid 0): synd 0x1: firmware internal error
mlx5_core 0000:42:00.0: print_health_info:509:(pid 0): ext_synd 0x8a47
mlx5_core 0000:42:00.0: print_health_info:510:(pid 0): raw fw_ver 0x161e03ec
mlx5_core 0000:42:00.1: poll_health:971:(pid 0): device's health compromised - reached miss count
mlx5_core 0000:42:00.1: print_health_info:492:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[0] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[1] 0xbadc0ffe
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[2] 0x00000001
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[3] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[4] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[5] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:498:(pid 0): assert_exit_ptr 0x2088e898
mlx5_core 0000:42:00.1: print_health_info:499:(pid 0): assert_callra 0x20890804
mlx5_core 0000:42:00.1: print_health_info:501:(pid 0): fw_ver 22.30.1004
mlx5_core 0000:42:00.1: print_health_info:502:(pid 0): time 0
mlx5_core 0000:42:00.1: print_health_info:503:(pid 0): hw_id 0x00000212
mlx5_core 0000:42:00.1: print_health_info:504:(pid 0): rfr 0
mlx5_core 0000:42:00.1: print_health_info:505:(pid 0): severity 3 (ERROR)
mlx5_core 0000:42:00.1: print_health_info:506:(pid 0): irisc_index 7
mlx5_core 0000:42:00.1: print_health_info:508:(pid 0): synd 0x1: firmware internal error
mlx5_core 0000:42:00.1: print_health_info:509:(pid 0): ext_synd 0x8a47
mlx5_core 0000:42:00.1: print_health_info:510:(pid 0): raw fw_ver 0x161e03ec
mlx5_core 0000:42:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
mlx5_core 0000:42:00.1: mlx5_wait_for_pages:898:(pid 18721): Skipping wait for vf pages stage
mlx5_core 0000:42:00.1: E-Switch: cleanup

Why does the error Detected insufficient power on the PCIe slot occur? Currently, I am able to run RDMA user programs normally, but does the error message indicate that the network card hardware is damaged?

In the usual case this event means the PCIe slot hosting the ConnectX doesn’t provide enough power.
You need to review the server’s motherboard user manual and check the PCIe power limits for each of the slots.

The card requires 27W for operating in all modes (including when using optical modules which may consume higher power than passive modules) - but as you can see it doesn’t necessarily mean the card won’t be operational.

1 Like

some more info can be seen here:
https://docs.nvidia.com/networking/display/ConnectX6DxEN/Troubleshooting

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.