Mlx5_core 0000:41:00.0: poll_health:853:(pid 0): device's health compromised

Hi,
We see below error followed by watchdog soft lock up while running our tests

[ 11.376220] mlx5_core 0000:41:00.0: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 11.376286] mlx5_core 0000:41:00.0: print_health_info:456:(pid 0): assert_var[0] 0x00004200
[ 11.376321] mlx5_core 0000:41:00.0: print_health_info:456:(pid 0): assert_var[1] 0x0010ca5c
[ 11.376359] mlx5_core 0000:41:00.0: print_health_info:456:(pid 0): assert_var[2] 0x00000000
[ 11.376392] mlx5_core 0000:41:00.0: print_health_info:456:(pid 0): assert_var[3] 0x00000000
[ 11.376424] mlx5_core 0000:41:00.0: print_health_info:456:(pid 0): assert_var[4] 0x00000000
[ 11.376457] mlx5_core 0000:41:00.0: print_health_info:459:(pid 0): assert_exit_ptr 0x00806990
[ 11.376491] mlx5_core 0000:41:00.0: print_health_info:461:(pid 0): assert_callra 0x00806c6c
[ 11.376532] mlx5_core 0000:41:00.0: print_health_info:464:(pid 0): fw_ver 16.31.2006
[ 11.376561] mlx5_core 0000:41:00.0: print_health_info:465:(pid 0): hw_id 0x0000020d
[ 11.376594] mlx5_core 0000:41:00.0: print_health_info:466:(pid 0): irisc_index 10
[ 11.376628] mlx5_core 0000:41:00.0: print_health_info:467:(pid 0): synd 0x8: unrecoverable hardware error
[ 11.376665] mlx5_core 0000:41:00.0: print_health_info:469:(pid 0): ext_synd 0x0001
[ 11.376697] mlx5_core 0000:41:00.0: print_health_info:471:(pid 0): raw fw_ver 0x101f07d6
dmesg_connectx5.log (250.2 KB)

Any thoughts?

JC

The issue is firmware stuck.

The fw ver you used is very old, 16.32.xx.

Please update to latest 16.35.xx

https://network.nvidia.com/support/firmware/firmware-downloads/

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.