Hi,
I updated my connectX4-LX to the latest available firmware, and now when I boot I get very unhappy messages in dmesg. What does this mean? I cannot use VPP/DPDK/XDP, it crashes the system.
# mlxfwmanager --query
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX4LX
Part Number: MCX4121A-ACA_Ax
Description: ConnectX-4 Lx EN network interface card; 25GbE dual-port SFP28; PCIe3.0 x8; ROHS R6
PSID: MT_2420110034
PCI Device Name: 0000:01:00.0
Base MAC: 506b4b297f7c
Versions: Current Available
FW 14.32.1900 14.32.1900
PXE 3.6.0502 3.6.0502
UEFI 14.25.0017 14.25.0017
Status: Up to date
$ dmesg
...
[ 9.960131] mlx5_core 0000:01:00.0: poll_health:1082:(pid 0): device's health compromised - reached miss count
[ 9.960151] mlx5_core 0000:01:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 9.960158] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[ 9.960163] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[1] 0xbadc0ffe
[ 9.960169] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 9.960174] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 9.960179] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 9.960184] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 9.960189] mlx5_core 0000:01:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x00874eac
[ 9.960195] mlx5_core 0000:01:00.0: print_health_info:505:(pid 0): assert_callra 0x00876e08
[ 9.960203] mlx5_core 0000:01:00.0: print_health_info:506:(pid 0): fw_ver 14.32.1900
[ 9.960209] mlx5_core 0000:01:00.0: print_health_info:508:(pid 0): time 1745607553
[ 9.960214] mlx5_core 0000:01:00.0: print_health_info:509:(pid 0): hw_id 0x0000020b
[ 9.960217] mlx5_core 0000:01:00.0: print_health_info:510:(pid 0): rfr 0
[ 9.960219] mlx5_core 0000:01:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 9.960225] mlx5_core 0000:01:00.0: print_health_info:512:(pid 0): irisc_index 2
[ 9.960232] mlx5_core 0000:01:00.0: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[ 9.960237] mlx5_core 0000:01:00.0: print_health_info:515:(pid 0): ext_synd 0x8a47
[ 9.960243] mlx5_core 0000:01:00.0: print_health_info:516:(pid 0): raw fw_ver 0xe020076c
[ 10.344098] mlx5_core 0000:01:00.1: poll_health:1082:(pid 0): device's health compromised - reached miss count
[ 10.344117] mlx5_core 0000:01:00.1: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 10.344126] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[ 10.344132] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[1] 0xbadc0ffe
[ 10.344138] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 10.344144] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 10.344150] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 10.344155] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 10.344160] mlx5_core 0000:01:00.1: print_health_info:504:(pid 0): assert_exit_ptr 0x00874eac
[ 10.344166] mlx5_core 0000:01:00.1: print_health_info:505:(pid 0): assert_callra 0x00876e08
[ 10.344175] mlx5_core 0000:01:00.1: print_health_info:506:(pid 0): fw_ver 14.32.1900
[ 10.344181] mlx5_core 0000:01:00.1: print_health_info:508:(pid 0): time 1745607553
[ 10.344187] mlx5_core 0000:01:00.1: print_health_info:509:(pid 0): hw_id 0x0000020b
[ 10.344191] mlx5_core 0000:01:00.1: print_health_info:510:(pid 0): rfr 0
[ 10.344195] mlx5_core 0000:01:00.1: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 10.344201] mlx5_core 0000:01:00.1: print_health_info:512:(pid 0): irisc_index 2
[ 10.344210] mlx5_core 0000:01:00.1: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[ 10.344216] mlx5_core 0000:01:00.1: print_health_info:515:(pid 0): ext_synd 0x8a47
[ 10.344222] mlx5_core 0000:01:00.1: print_health_info:516:(pid 0): raw fw_ver 0xe020076c
# flint -d 01:00.0 -ocr hw query
-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
HwDevId 523
HwRevId 0x0
Flash Info:
Type W25QxxBV
TotalSize 0x1000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
QuadEn 1
Flash0.WriteProtected Disabled
JEDEC_ID 0x1840ef
TBS, BP[3:0] 0, 0000