Assistance with Firmware Issue on MCX653106A-ECAT Card after OFED Update on Rocky Linux 9.5

I’m encountering an issue with my MCX653106A-ECAT card after updating to the latest available OFED for Rocky Linux 9.5. After the update, I started receiving the following error in dmesg logs:
[ 3091.007674] mlx5_core 0000:71:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR.

To resolve this, I have attempted the following steps:

  1. Ran flint -d commands (P1 and P2) to change the mode of the card.
  2. Attempted to re-flash the firmware.

However, none of these actions have resolved the issue, and the error persists.

I’ve attached the relevant dmesg output for further context. Any assistance or guidance on how to resolve this would be greatly appreciated.
[ 3091.007615] mlx5_core 0000:71:00.0: poll_health:1082:(pid 0): device’s health compromised - reached miss count
[ 3091.007674] mlx5_core 0000:71:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 3091.007701] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[ 3091.007717] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[1] 0x00000000
[ 3091.007733] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 3091.007748] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 3091.007764] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 3091.007779] mlx5_core 0000:71:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 3091.007794] mlx5_core 0000:71:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x209f2258
[ 3091.007811] mlx5_core 0000:71:00.0: print_health_info:505:(pid 0): assert_callra 0x209f9118
[ 3091.007831] mlx5_core 0000:71:00.0: print_health_info:506:(pid 0): fw_ver 20.39.4082
[ 3091.007847] mlx5_core 0000:71:00.0: print_health_info:508:(pid 0): time 0
[ 3091.007861] mlx5_core 0000:71:00.0: print_health_info:509:(pid 0): hw_id 0x0000020f
[ 3091.007874] mlx5_core 0000:71:00.0: print_health_info:510:(pid 0): rfr 0
[ 3091.007885] mlx5_core 0000:71:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 3091.007901] mlx5_core 0000:71:00.0: print_health_info:512:(pid 0): irisc_index 6
[ 3091.007918] mlx5_core 0000:71:00.0: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[ 3091.007935] mlx5_core 0000:71:00.0: print_health_info:515:(pid 0): ext_synd 0x8a02
[ 3091.007950] mlx5_core 0000:71:00.0: print_health_info:516:(pid 0): raw fw_ver 0x14270ff2
[ 3091.839630] mlx5_core 0000:71:00.1: poll_health:1082:(pid 0): device’s health compromised - reached miss count
[ 3091.839667] mlx5_core 0000:71:00.1: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 3091.839689] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[ 3091.839706] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[1] 0x00000000
[ 3091.839721] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 3091.839737] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 3091.839752] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 3091.839767] mlx5_core 0000:71:00.1: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 3091.839782] mlx5_core 0000:71:00.1: print_health_info:504:(pid 0): assert_exit_ptr 0x209f2258
[ 3091.839797] mlx5_core 0000:71:00.1: print_health_info:505:(pid 0): assert_callra 0x209f9118
[ 3091.839817] mlx5_core 0000:71:00.1: print_health_info:506:(pid 0): fw_ver 20.39.4082
[ 3091.839832] mlx5_core 0000:71:00.1: print_health_info:508:(pid 0): time 0
[ 3091.839846] mlx5_core 0000:71:00.1: print_health_info:509:(pid 0): hw_id 0x0000020f
[ 3091.839859] mlx5_core 0000:71:00.1: print_health_info:510:(pid 0): rfr 0
[ 3091.839869] mlx5_core 0000:71:00.1: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 3091.839885] mlx5_core 0000:71:00.1: print_health_info:512:(pid 0): irisc_index 6
[ 3091.839901] mlx5_core 0000:71:00.1: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[ 3091.839918] mlx5_core 0000:71:00.1: print_health_info:515:(pid 0): ext_synd 0x8a02
[ 3091.839933] mlx5_core 0000:71:00.1: print_health_info:516:(pid 0): raw fw_ver 0x14270ff2

Thank you in advance!

Best regards,
Ben,

Please check my reply on this post:

Regards,
Yaniv

Hey Yaniv.

ive already tried downgrade and more .

Thank you for update ive constuled with the vendor of the server .
and the issue were an bios update that fixed everything.

the ticket can be closed.

Best Regards,
Ben.