Random crash with connectx-5 and connectx-6 dx

Hi,

I have multiple supermicro AMD Epyc server and randomly I recive the following errors. The server after this errors, stay up but without network. I need reboot them to fix the issue. I have last firmware installed on all components (motherboard, network adapaters).
Debian 12.4 with kernel 6.5.

Blockquote
2024-01-13T06:46:41.798966+00:00 server1 kernel: [100789.636194] mlx5_core 0000:41:00.1: poll_health:825:(pid 0): Fatal error 1 detected
2024-01-13T06:46:41.798987+00:00 server1 kernel: [100789.636652] mlx5_core 0000:41:00.1: print_health_info:429:(pid 0): PCI slot is unavailable
2024-01-13T06:46:44.014966+00:00 server1 kernel: [100791.849430] mlx5_core 0000:41:00.1: mlx5_health_try_recover:341:(pid 2207027): handling bad device here
2024-01-13T06:46:44.015026+00:00 server1 kernel: [100791.850335] mlx5_core 0000:41:00.1: mlx5_error_sw_reset:245:(pid 2207027): start
2024-01-13T06:46:44.038949+00:00 server1 kernel: [100791.873293] mlx5_core 0000:41:00.0: poll_health:825:(pid 0): Fatal error 1 detected
2024-01-13T06:46:44.038954+00:00 server1 kernel: [100791.874315] mlx5_core 0000:41:00.0: print_health_info:429:(pid 0): PCI slot is unavailable
2024-01-13T06:46:44.566957+00:00 server1 kernel: [100792.402154] mlx5_core 0000:41:00.0 enp65s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
2024-01-13T06:46:44.567002+00:00 server1 kernel: [100792.402825] mlx5_core 0000:41:00.0 enp65s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
2024-01-13T06:46:44.567012+00:00 server1 kernel: [100792.403235] mlx5_core 0000:41:00.1 enp65s0f1np1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
2024-01-13T06:46:44.567013+00:00 server1 kernel: [100792.403578] mlx5_core 0000:41:00.1 enp65s0f1np1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
2024-01-13T06:46:46.018960+00:00 server1 kernel: [100793.853264] mlx5_core 0000:41:00.1: NIC IFC still 7 after 2000ms.
2024-01-13T06:46:46.018976+00:00 server1 kernel: [100793.853741] mlx5_core 0000:41:00.1: mlx5_error_sw_reset:278:(pid 2207027): end
2024-01-13T06:46:47.046958+00:00 server1 kernel: [100794.882093] mlx5_core 0000:41:00.0: mlx5_health_try_recover:341:(pid 2215766): handling bad device here
2024-01-13T06:46:47.046975+00:00 server1 kernel: [100794.883164] mlx5_core 0000:41:00.0: mlx5_error_sw_reset:245:(pid 2215766): start
2024-01-13T06:46:47.650985+00:00 server1 kernel: [100795.484683] bond0: (slave enp65s0f1np1): Releasing backup interface
2024-01-13T06:46:47.895282+00:00 server1 kernel: [100795.729366] mlx5_core 0000:41:00.1 enp65s0f1np1 (unregistering): left promiscuous mode
2024-01-13T06:46:47.895609+00:00 server1 kernel: [100795.730945] mlx5_core 0000:41:00.1 enp65s0f1np1 (unregistering): left allmulticast mode
2024-01-13T06:46:47.895645+00:00 server1 kernel: [100795.732272] mlx5_core 0000:41:00.1: mlx5e_execute_l2_action:603:(pid 2217743): MPFS, failed to add mac 1c:34:da:68:d5:9b, err(-67)
2024-01-13T06:46:47.898960+00:00 server1 kernel: [100795.735911] mlx5_core 0000:41:00.0 enp65s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
2024-01-13T06:46:47.902960+00:00 server1 kernel: [100795.737235] mlx5_core 0000:41:00.0 enp65s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
2024-01-13T06:46:48.151446+00:00 server1 kernel: [100795.985312] mlx5_ib.rdma: probe of mlx5_core.rdma.0 failed with error -12
2024-01-13T06:46:48.171243+00:00 server1 kernel: [100796.005175] mlx5_core 0000:41:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
2024-01-13T06:46:48.171573+00:00 server1 kernel: [100796.005619] mlx5_core 0000:41:00.1: mlx5_wait_for_pages:786:(pid 2207027): Skipping wait for vf pages stage
2024-01-13T06:46:48.171630+00:00 server1 kernel: [100796.006020] mlx5_core 0000:41:00.1: mlx5_wait_for_pages:786:(pid 2207027): Skipping wait for vf pages stage
2024-01-13T06:46:49.078973+00:00 server1 kernel: [100796.913149] mlx5_core 0000:41:00.0: NIC IFC still 7 after 2000ms.
2024-01-13T06:46:49.078999+00:00 server1 kernel: [100796.914302] mlx5_core 0000:41:00.0: mlx5_error_sw_reset:278:(pid 2215766): end
2024-01-13T06:46:49.678971+00:00 server1 kernel: [100797.513874] bond0: (slave enp65s0f0np0): Removing an active aggregator
2024-01-13T06:46:49.678994+00:00 server1 kernel: [100797.514721] bond0: (slave enp65s0f0np0): Releasing backup interface
2024-01-13T06:46:49.946975+00:00 server1 kernel: [100797.782204] mlx5_core 0000:41:00.0 enp65s0f0np0 (unregistering): left promiscuous mode
2024-01-13T06:46:49.946987+00:00 server1 kernel: [100797.782762] mlx5_core 0000:41:00.0 enp65s0f0np0 (unregistering): left allmulticast mode
2024-01-13T06:46:49.950971+00:00 server1 kernel: [100797.785548] vmbr1: port 1(bond0) entered disabled state
2024-01-13T06:46:49.954963+00:00 server1 kernel: [100797.789361] vmbr0: port 1(bond0.102) entered disabled state
2024-01-13T06:46:50.202942+00:00 server1 kernel: [100798.037115] mlx5_core 0000:41:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
2024-01-13T06:46:50.202950+00:00 server1 kernel: [100798.039052] mlx5_core 0000:41:00.0: mlx5_wait_for_pages:786:(pid 2215766): Skipping wait for vf pages stage
2024-01-13T06:46:50.202952+00:00 server1 kernel: [100798.039964] mlx5_core 0000:41:00.0: mlx5_wait_for_pages:786:(pid 2215766): Skipping wait for vf pages stage

Greetings,

Thank you for reaching out to us!

It appears that the issue you’re encountering may require a more in-depth analysis than what our community pages can provide. To assist you effectively, we would need additional details about the systems in question.

We kindly ask that you initiate a support case with our team by visiting the Enterprise Support Portal. Our dedicated experts are ready to dive into the problem and work with you towards a timely resolution.

We appreciate your cooperation and look forward to assisting you. Wishing you a wonderful day ahead!

Warm regards,

Ilan