Hello, guys
Please help me to investigate following issue.
I installed ConnectX-6 NICs in my server and frequently experience link up and link down situations sometimes, don’t sure what is root cause.
Attached is the “dmesg“ log at the time when the issue occurred.
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: poll_health:1099:(pid 0): device’s health compromised - reached miss count
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:501:(pid 0): assert_var[1] 0x00000000
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x212046fc
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:505:(pid 0): assert_callra 0x2120b3f0
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:506:(pid 0): fw_ver 20.43.2566
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:508:(pid 0): time 0
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:509:(pid 0): hw_id 0x0000020f
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:510:(pid 0): rfr 0
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:512:(pid 0): irisc_index 6
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:515:(pid 0): ext_synd 0x8a02
[Tue Mar 24 14:32:27 2026] mlx5_core 0000:a9:00.0: print_health_info:516:(pid 0): raw fw_ver 0x142b0a06
…..
[Tue Mar 24 15:47:07 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link down
[Tue Mar 24 15:47:07 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link down
[Tue Mar 24 15:47:13 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link up
[Tue Mar 24 15:47:14 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link up
[Tue Mar 24 15:47:26 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link down
[Tue Mar 24 15:47:26 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link down
[Tue Mar 24 15:47:32 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link up
[Tue Mar 24 15:47:33 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link down
[Tue Mar 24 15:47:35 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link up
[Tue Mar 24 15:47:44 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link down
[Tue Mar 24 15:47:45 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link down
[Tue Mar 24 15:47:47 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link up
[Tue Mar 24 15:47:52 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link up
[Tue Mar 24 15:48:01 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link down
[Tue Mar 24 15:48:02 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link down
[Tue Mar 24 15:48:09 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link up
[Tue Mar 24 15:48:09 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link up
[Tue Mar 24 15:48:18 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link down
[Tue Mar 24 15:48:20 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link down
[Tue Mar 24 15:48:25 2026] mlx5_core 0000:a9:00.0 ens4f0np0: Link up
[Tue Mar 24 15:48:26 2026] mlx5_core 0000:a9:00.1 ens4f1np1: Link up
oss36_dmesg.log (835.0 KB)