IB slot poll_health fatal error, health recovery failed

Hi, all,

I have an IB slot repeatedly fails the card on an AMD EPYC 9554 node. .
There are two dual port NDR200 IB cards on this node. All ports are connected.
The 87:00 slot repeated fails, as I have swapped the cards and reboots.
It can work for some short time after reboot, but fails after a while.

I have disabled the CState from the BIOS and set intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1 processor.ignore_ppc=1 on the kernel cmdline.

What could be wrong? Thanks.

lspci | grep -i mellanox

67:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]

67:00.1 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]

87:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]

87:00.1 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]

[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: poll_health:838:(pid 0): Fatal error 1 detected
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:457:(pid 0): assert_var[0] 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:457:(pid 0): assert_var[1] 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:457:(pid 0): assert_var[2] 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:457:(pid 0): assert_var[3] 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:457:(pid 0): assert_var[4] 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:460:(pid 0): assert_exit_ptr 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:462:(pid 0): assert_callra 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:464:(pid 0): fw_ver 65535.65535.65535
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:465:(pid 0): hw_id 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:466:(pid 0): irisc_index 255
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:468:(pid 0): synd 0xff: unrecognized error
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:469:(pid 0): ext_synd 0xffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: print_health_info:471:(pid 0): raw fw_ver 0xffffffff
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: mlx5_health_try_recover:379:(pid 95881): handling bad device here
[Thu Feb 8 15:16:36 2024] mlx5_core 0000:87:00.0: mlx5_error_sw_reset:239:(pid 95881): start
[Thu Feb 8 15:16:37 2024] mlx5_core 0000:87:00.0: NIC IFC still 7 after 1000ms.
[Thu Feb 8 15:16:37 2024] mlx5_core 0000:87:00.0: mlx5_error_sw_reset:272:(pid 95881): end
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: poll_health:838:(pid 0): Fatal error 1 detected
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:457:(pid 0): assert_var[0] 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:457:(pid 0): assert_var[1] 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:457:(pid 0): assert_var[2] 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:457:(pid 0): assert_var[3] 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:457:(pid 0): assert_var[4] 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:460:(pid 0): assert_exit_ptr 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:462:(pid 0): assert_callra 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:464:(pid 0): fw_ver 65535.65535.65535
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:465:(pid 0): hw_id 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:466:(pid 0): irisc_index 255
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:468:(pid 0): synd 0xff: unrecognized error
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:469:(pid 0): ext_synd 0xffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: print_health_info:471:(pid 0): raw fw_ver 0xffffffff
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: mlx5_health_try_recover:379:(pid 96872): handling bad device here
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.1: mlx5_error_sw_reset:239:(pid 96872): start
[Thu Feb 8 15:16:38 2024] mlx5_core 0000:87:00.0: mlx5_wait_for_pages:774:(pid 95881): Skipping wait for vf pages stage
[Thu Feb 8 15:16:39 2024] mlx5_core 0000:87:00.1: NIC IFC still 7 after 1000ms.
[Thu Feb 8 15:16:39 2024] mlx5_core 0000:87:00.1: mlx5_error_sw_reset:272:(pid 96872): end
[Thu Feb 8 15:16:40 2024] mlx5_core 0000:87:00.1: mlx5_wait_for_pages:774:(pid 96872): Skipping wait for vf pages stage
[Thu Feb 8 15:17:40 2024] mlx5_core 0000:87:00.0: mlx5_health_try_recover:382:(pid 95881): health recovery flow aborted, PCI reads still not working
[Thu Feb 8 15:17:40 2024] mlx5_core 0000:87:00.0: health_recover_work:409:(pid 95881): Health recovery failed
[Thu Feb 8 15:17:42 2024] mlx5_core 0000:87:00.1: mlx5_health_try_recover:382:(pid 96872): health recovery flow aborted, PCI reads still not working
[Thu Feb 8 15:17:42 2024] mlx5_core 0000:87:00.1: health_recover_work:409:(pid 96872): Health recovery failed

flint -d 87:00.0 query

Image type: FS4
FW Version: 28.38.1002
FW Release Date: 3.8.2023
Product Version: 28.38.1002
Rom Info: type=UEFI version=14.31.20 cpu=AMD64,AARCH64
type=PXE version=3.7.201 cpu=AMD64
Description: UID GuidsNumber
Base GUID: 946dae0300612a88 16
Base MAC: 946dae612a88 16
Image VSD: N/A
Device VSD: N/A
PSID: LNV0000000058
Security Attributes: secure-fw

Hello @wei.guo,

Thank you for posting your query on our community. You mentioned that you observed issue with slot 87:00 even after swapping the cards, which indicates it is not a card related issue but rather specific to the PCIe slot.

Also, I notice that your adapter is showing a Lenovo PSID. I recommend reaching out to Lenovo support for further assistance.

Thanks,
Bhargavi