After I restarted the OFED driver using the command (sudo /etc/init.d/openibd restart
), the kernel log displayed the following information:
mlx5_pcie_event:301:(pid 21676): Detected insufficient power on the PCIe slot (27W).
mlx5_core 0000:42:00.0: poll_health:971:(pid 0): device's health compromised - reached miss count
mlx5_core 0000:42:00.0: print_health_info:492:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[0] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[1] 0xbadc0ffe
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[2] 0x00000001
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[3] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[4] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:496:(pid 0): assert_var[5] 0x00000000
mlx5_core 0000:42:00.0: print_health_info:498:(pid 0): assert_exit_ptr 0x2088e898
mlx5_core 0000:42:00.0: print_health_info:499:(pid 0): assert_callra 0x20890804
mlx5_core 0000:42:00.0: print_health_info:501:(pid 0): fw_ver 22.30.1004
mlx5_core 0000:42:00.0: print_health_info:502:(pid 0): time 0
mlx5_core 0000:42:00.0: print_health_info:503:(pid 0): hw_id 0x00000212
mlx5_core 0000:42:00.0: print_health_info:504:(pid 0): rfr 0
mlx5_core 0000:42:00.0: print_health_info:505:(pid 0): severity 3 (ERROR)
mlx5_core 0000:42:00.0: print_health_info:506:(pid 0): irisc_index 7
mlx5_core 0000:42:00.0: print_health_info:508:(pid 0): synd 0x1: firmware internal error
mlx5_core 0000:42:00.0: print_health_info:509:(pid 0): ext_synd 0x8a47
mlx5_core 0000:42:00.0: print_health_info:510:(pid 0): raw fw_ver 0x161e03ec
mlx5_core 0000:42:00.1: poll_health:971:(pid 0): device's health compromised - reached miss count
mlx5_core 0000:42:00.1: print_health_info:492:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[0] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[1] 0xbadc0ffe
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[2] 0x00000001
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[3] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[4] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:496:(pid 0): assert_var[5] 0x00000000
mlx5_core 0000:42:00.1: print_health_info:498:(pid 0): assert_exit_ptr 0x2088e898
mlx5_core 0000:42:00.1: print_health_info:499:(pid 0): assert_callra 0x20890804
mlx5_core 0000:42:00.1: print_health_info:501:(pid 0): fw_ver 22.30.1004
mlx5_core 0000:42:00.1: print_health_info:502:(pid 0): time 0
mlx5_core 0000:42:00.1: print_health_info:503:(pid 0): hw_id 0x00000212
mlx5_core 0000:42:00.1: print_health_info:504:(pid 0): rfr 0
mlx5_core 0000:42:00.1: print_health_info:505:(pid 0): severity 3 (ERROR)
mlx5_core 0000:42:00.1: print_health_info:506:(pid 0): irisc_index 7
mlx5_core 0000:42:00.1: print_health_info:508:(pid 0): synd 0x1: firmware internal error
mlx5_core 0000:42:00.1: print_health_info:509:(pid 0): ext_synd 0x8a47
mlx5_core 0000:42:00.1: print_health_info:510:(pid 0): raw fw_ver 0x161e03ec
mlx5_core 0000:42:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
mlx5_core 0000:42:00.1: mlx5_wait_for_pages:898:(pid 18721): Skipping wait for vf pages stage
mlx5_core 0000:42:00.1: E-Switch: cleanup
Why does the error Detected insufficient power on the PCIe slot
occur? Currently, I am able to run RDMA user programs normally, but does the error message indicate that the network card hardware is damaged?