I see the following errors on one of my nodes running on debian11:
[ 12.454838] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[0] 0x00000008
[ 12.463238] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[1] 0x0006523c
[ 12.471636] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[2] 0x00000000
[ 12.480031] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[3] 0x00000000
[ 12.488422] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[4] 0x00000000
[ 12.496827] mlx5_core 0000:af:00.0: print_health_info:393:(pid 0): assert_exit_ptr 0x2081d27c
[ 12.505400] mlx5_core 0000:af:00.0: print_health_info:395:(pid 0): assert_callra 0x2081d6dc
[ 12.513804] mlx5_core 0000:af:00.0: print_health_info:398:(pid 0): fw_ver 20.39.1002
[ 12.521589] mlx5_core 0000:af:00.0: print_health_info:399:(pid 0): hw_id 0x0000020f
[ 12.529293] mlx5_core 0000:af:00.0: print_health_info:400:(pid 0): irisc_index 10
[ 12.536849] mlx5_core 0000:af:00.0: print_health_info:401:(pid 0): synd 0x8: unrecoverable hardware error
[ 12.546487] mlx5_core 0000:af:00.0: print_health_info:403:(pid 0): ext_synd 0x0111
[ 12.554125] mlx5_core 0000:af:00.0: print_health_info:405:(pid 0): raw fw_ver 0x142703ea
I have seen similar errors being reported here before, with the solution being to update the firmware. I have already done that but still see the error. Are there any other solutions to check if it is indeed a hardware or a software error? Thank you
Best Regards