Infiniband Error, device's health compromised

I see the following errors on one of my nodes running on debian11:

[   12.454838] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[0] 0x00000008
[   12.463238] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[1] 0x0006523c
[   12.471636] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[2] 0x00000000
[   12.480031] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[3] 0x00000000
[   12.488422] mlx5_core 0000:af:00.0: print_health_info:390:(pid 0): assert_var[4] 0x00000000
[   12.496827] mlx5_core 0000:af:00.0: print_health_info:393:(pid 0): assert_exit_ptr 0x2081d27c
[   12.505400] mlx5_core 0000:af:00.0: print_health_info:395:(pid 0): assert_callra 0x2081d6dc
[   12.513804] mlx5_core 0000:af:00.0: print_health_info:398:(pid 0): fw_ver 20.39.1002
[   12.521589] mlx5_core 0000:af:00.0: print_health_info:399:(pid 0): hw_id 0x0000020f
[   12.529293] mlx5_core 0000:af:00.0: print_health_info:400:(pid 0): irisc_index 10
[   12.536849] mlx5_core 0000:af:00.0: print_health_info:401:(pid 0): synd 0x8: unrecoverable hardware error
[   12.546487] mlx5_core 0000:af:00.0: print_health_info:403:(pid 0): ext_synd 0x0111
[   12.554125] mlx5_core 0000:af:00.0: print_health_info:405:(pid 0): raw fw_ver 0x142703ea

I have seen similar errors being reported here before, with the solution being to update the firmware. I have already done that but still see the error. Are there any other solutions to check if it is indeed a hardware or a software error? Thank you
Best Regards

Hello,

When there are “assert” messages, it means that some abnormal situation had occured in the FW.
We are not familiar with any known issues with this pointer error - 0x2081d27c.
In order to further debug this issue, please open a case at: enterprisesupport@nvidia.com, and it will be handled according to entitlement.

Best Regards,
Jonathan.

Hi,
I will do so but just so I understand it a bit more myself, is it always an issue with the firmware? Because we have the same firmware and cards, with the same workload on other nodes where I don’t see this error popping up, nor any crashes.
Also I forgot to paste the error correctly, before the log I posted, there is also

[ 12.952777] mlx5_core 0000:3b:00.0: poll_health:735:(pid 0): device’s health compromised - reached miss count

Hello,

Yes - asserts should not occur.
It means something unexpected occurred - either in the Firmware or the Hardware itself.

Best Regards,
Jonathan.