I’m encountering an issue with my MCX653106A-ECAT card after updating to the latest available OFED for Rocky Linux 9.5. After the update, I started receiving the following error in dmesg logs:
[ 3091.007674] mlx5_core 0000:71:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR
.
To resolve this, I have attempted the following steps:
- Ran
flint -d
commands (P1 and P2) to change the mode of the card. - Attempted to re-flash the firmware.
However, none of these actions have resolved the issue, and the error persists.
I’ve attached the relevant dmesg output for further context. Any assistance or guidance on how to resolve this would be greatly appreciated.
[ 3091.007615] mlx5_core 0000:71:00.0: poll_health:1082:(pid 0): device’s health compromised - reached miss count
[ 3091.007674] mlx5_core 0000:71:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 3091.007701] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[ 3091.007717] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[1] 0x00000000
[ 3091.007733] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 3091.007748] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 3091.007764] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 3091.007779] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 3091.007794] mlx5_core 0000:71:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x209f2258
[ 3091.007811] mlx5_core 0000:71:00.0: print_health_info:505:(pid 0): assert_callra 0x209f9118
[ 3091.007831] mlx5_core 0000:71:00.0: print_health_info:506:(pid 0): fw_ver 20.39.4082
[ 3091.007847] mlx5_core 0000:71:00.0: print_health_info:508:(pid 0): time 0
[ 3091.007861] mlx5_core 0000:71:00.0: print_health_info:509:(pid 0): hw_id 0x0000020f
[ 3091.007874] mlx5_core 0000:71:00.0: print_health_info:510:(pid 0): rfr 0
[ 3091.007885] mlx5_core 0000:71:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 3091.007901] mlx5_core 0000:71:00.0: print_health_info:512:(pid 0): irisc_index 6
[ 3091.007918] mlx5_core 0000:71:00.0: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[ 3091.007935] mlx5_core 0000:71:00.0: print_health_info:515:(pid 0): ext_synd 0x8a02
[ 3091.007950] mlx5_core 0000:71:00.0: print_health_info:516:(pid 0): raw fw_ver 0x14270ff2
[ 3091.839630] mlx5_core 0000:71:00.1: poll_health:1082:(pid 0): device’s health compromised - reached miss count
[ 3091.839667] mlx5_core 0000:71:00.1: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 3091.839689] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[ 3091.839706] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[1] 0x00000000
[ 3091.839721] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 3091.839737] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 3091.839752] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 3091.839767] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 3091.839782] mlx5_core 0000:71:00.1: print_health_info:504:(pid 0): assert_exit_ptr 0x209f2258
[ 3091.839797] mlx5_core 0000:71:00.1: print_health_info:505:(pid 0): assert_callra 0x209f9118
[ 3091.839817] mlx5_core 0000:71:00.1: print_health_info:506:(pid 0): fw_ver 20.39.4082
[ 3091.839832] mlx5_core 0000:71:00.1: print_health_info:508:(pid 0): time 0
[ 3091.839846] mlx5_core 0000:71:00.1: print_health_info:509:(pid 0): hw_id 0x0000020f
[ 3091.839859] mlx5_core 0000:71:00.1: print_health_info:510:(pid 0): rfr 0
[ 3091.839869] mlx5_core 0000:71:00.1: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 3091.839885] mlx5_core 0000:71:00.1: print_health_info:512:(pid 0): irisc_index 6
[ 3091.839901] mlx5_core 0000:71:00.1: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[ 3091.839918] mlx5_core 0000:71:00.1: print_health_info:515:(pid 0): ext_synd 0x8a02
[ 3091.839933] mlx5_core 0000:71:00.1: print_health_info:516:(pid 0): raw fw_ver 0x14270ff2
Thank you in advance!
Best regards,
Ben,