MT27710 Family [ConnectX-4 Lx] NIC went into a failure -firmware 14.31.1014

I have few cases in different servers and would like to have advice it is SW or HW related ?
below the log :
mlx5_core 0000:19:00.0: poll_health:839:(pid 0): Fatal error 1 detected
mlx5_core 0000:19:00.0: print_health_info:458:(pid 0): assert_var[0] 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:458:(pid 0): assert_var[1] 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:458:(pid 0): assert_var[2] 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:458:(pid 0): assert_var[3] 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:458:(pid 0): assert_var[4] 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:461:(pid 0): assert_exit_ptr 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:463:(pid 0): assert_callra 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:465:(pid 0): fw_ver 65535.65535.65535
mlx5_core 0000:19:00.0: print_health_info:466:(pid 0): hw_id 0xffffffff
mlx5_core 0000:19:00.0: print_health_info:467:(pid 0): irisc_index 255
mlx5_core 0000:19:00.0: print_health_info:469:(pid 0): synd 0xff: unrecognized error
mlx5_core 0000:19:00.0: print_health_info:470:(pid 0): ext_synd 0xffff
mlx5_core 0000:19:00.0: print_health_info:472:(pid 0): raw fw_ver 0xffffffff
mlx5_core 0000:19:00.0: mlx5_health_try_recover:381:(pid 136014): handling bad device here
mlx5_core 0000:19:00.0: mlx5_error_sw_reset:243:(pid 136014): start
mlx5_core 0000:19:00.1: poll_health:839:(pid 0): Fatal error 1 detected
mlx5_core 0000:19:00.1: print_health_info:458:(pid 0): assert_var[0] 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:458:(pid 0): assert_var[1] 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:458:(pid 0): assert_var[2] 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:458:(pid 0): assert_var[3] 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:458:(pid 0): assert_var[4] 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:461:(pid 0): assert_exit_ptr 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:463:(pid 0): assert_callra 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:465:(pid 0): fw_ver 65535.65535.65535
mlx5_core 0000:19:00.1: print_health_info:466:(pid 0): hw_id 0xffffffff
mlx5_core 0000:19:00.1: print_health_info:467:(pid 0): irisc_index 255
mlx5_core 0000:19:00.1: print_health_info:469:(pid 0): synd 0xff: unrecognized error
mlx5_core 0000:19:00.1: print_health_info:470:(pid 0): ext_synd 0xffff
mlx5_core 0000:19:00.1: print_health_info:472:(pid 0): raw fw_ver 0xffffffff
mlx5_core 0000:19:00.1: mlx5_health_try_recover:381:(pid 214631): handling bad device here
mlx5_core 0000:19:00.1: mlx5_error_sw_reset:243:(pid 214631): start

Hi,
It is a server HW issue. The PCIe media between the ConnectX card and the server went down and thus the card reports an all 1’s in the error reporting.
I have seen those usually when the NIC is placed in a non-standard server design but usually we work with the server manufacture to resolve these issues.
Regards,
Yaniv

Hey Yaniv - Thank you for this information .
Yes , I am trying to have the manufacture diagnostic on this .
strange it happen with 3 Nic are installed in differenr servers and one of them was able to recovered by reboot the server but later was fail again .