Hello, guys
Please help me to investigate following issue.
I have two Dell servers with two MCX4121A-ACA_Ax card installed. OS Ubuntu 22.04 LTS.
Recently I have done firmware upgrade to latest version: 14.32.1010 and now I have following error in dmesg on both of the servers for both of cards:
[ 14.944178] mlx5_core 0000:3b:00.0: poll_health:971:(pid 0): device’s health compromised - reached miss count
[ 14.946551] mlx5_core 0000:3b:00.0: print_health_info:491:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 14.951182] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[0] 0x00000000
[ 14.953561] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[1] 0x000000b9
[ 14.955588] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[2] 0x00000040
[ 14.957305] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[3] 0x00000000
[ 14.958959] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[4] 0x00000000
[ 14.960531] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[5] 0x00000000
[ 14.962061] mlx5_core 0000:3b:00.0: print_health_info:498:(pid 0): assert_exit_ptr 0x008771e4
[ 14.963559] mlx5_core 0000:3b:00.0: print_health_info:499:(pid 0): assert_callra 0x00810ba4
[ 14.964931] mlx5_core 0000:3b:00.0: print_health_info:500:(pid 0): fw_ver 14.32.1010
[ 14.965864] mlx5_core 0000:3b:00.0: print_health_info:502:(pid 0): time 0
[ 14.966787] mlx5_core 0000:3b:00.0: print_health_info:503:(pid 0): hw_id 0x0000020b
[ 14.967690] mlx5_core 0000:3b:00.0: print_health_info:504:(pid 0): rfr 0
[ 14.968584] mlx5_core 0000:3b:00.0: print_health_info:505:(pid 0): severity 3 (ERROR)
[ 14.969454] mlx5_core 0000:3b:00.0: print_health_info:506:(pid 0): irisc_index 2
[ 14.970310] mlx5_core 0000:3b:00.0: print_health_info:507:(pid 0): synd 0x1: firmware internal error
[ 14.971166] mlx5_core 0000:3b:00.0: print_health_info:509:(pid 0): ext_synd 0x805b
[ 14.972008] mlx5_core 0000:3b:00.0: print_health_info:510:(pid 0): raw fw_ver 0xe02003f2
Here is device hw query output:
flint -d 3b:00.1 -ocr hw query
-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
HwDevId 523
HwRevId 0x0
Flash Info:
Type W25QxxBV
TotalSize 0x1000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
QuadEn 1
Flash0.WriteProtected Top,8-SubSectors
JEDEC_ID 0x1840ef
Now I need to get these servers to production but think that this error can get me to network interruption issues.
Please help and advice.