Problem with Mellanox ConnectX-5 (poll_health:785:(pid 0): Fatal error 1 detected)

Got some problem in dmesg:
[3028495.729611] mlx5_core 0000:01:00.0: poll_health:785:(pid 0): Fatal error 1 detected
[3028495.729628] mlx5_core 0000:01:00.0: print_health_info:425:(pid 0): PCI slot is unavailable
[3028495.837729] mlx5_core 0000:01:00.0 enp1s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028495.837744] mlx5_core 0000:01:00.0 enp1s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028497.777471] mlx5_core 0000:01:00.1: poll_health:785:(pid 0): Fatal error 1 detected
[3028497.777534] mlx5_core 0000:01:00.1: print_health_info:425:(pid 0): PCI slot is unavailable
[3028497.777594] mlx5_core 0000:01:00.1 enp1s0f1np1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028497.777664] mlx5_core 0000:01:00.1 enp1s0f1np1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028497.778362] mlx5_core 0000:01:00.1: SW reset semaphore is already in use
[3028497.778417] mlx5_core 0000:01:00.1: mlx5_health_try_recover:335:(pid 3542898): handling bad device here
[3028497.778446] mlx5_core 0000:01:00.1: mlx5_error_sw_reset:231:(pid 3542898): start
[3028498.838699] mlx5_core 0000:01:00.0 enp1s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028498.838714] mlx5_core 0000:01:00.0 enp1s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028498.838736] mlx5_core 0000:01:00.1 enp1s0f1np1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028498.838746] mlx5_core 0000:01:00.1 enp1s0f1np1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028499.793490] mlx5_core 0000:01:00.1: NIC IFC still 7 after 2000ms.
[3028499.793503] mlx5_core 0000:01:00.1: mlx5_error_sw_reset:268:(pid 3542898): end
[3028501.255875] mlx5_core 0000:01:00.0: mlx5_health_try_recover:335:(pid 3540626): handling bad device here
[3028501.255879] mlx5_core 0000:01:00.0: mlx5_error_sw_reset:231:(pid 3540626): start
[3028501.397387] mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[3028501.441396] mlx5_core 0000:01:00.1: mlx5_wait_for_pages:789:(pid 3542898): Skipping wait for vf pages stage
[3028501.441398] mlx5_core 0000:01:00.1: mlx5_wait_for_pages:789:(pid 3542898): Skipping wait for vf pages stage
[3028501.839845] mlx5_core 0000:01:00.0 enp1s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028501.839863] mlx5_core 0000:01:00.0 enp1s0f0np0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
[3028503.277461] mlx5_core 0000:01:00.0: NIC IFC still 7 after 2000ms.
[3028503.277470] mlx5_core 0000:01:00.0: mlx5_error_sw_reset:268:(pid 3540626): end
[3028503.869337] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[3028503.917351] mlx5_core 0000:01:00.0: mlx5_wait_for_pages:789:(pid 3540626): Skipping wait for vf pages stage
[3028503.917353] mlx5_core 0000:01:00.0: mlx5_wait_for_pages:789:(pid 3540626): Skipping wait for vf pages stage
[3028564.620227] mlx5_core 0000:01:00.1: mlx5_health_try_recover:338:(pid 3542898): health recovery flow aborted, PCI reads still not working
[3028567.148186] mlx5_core 0000:01:00.0: mlx5_health_try_recover:338:(pid 3540626): health recovery flow aborted, PCI reads still not working

after that interfaces are missing from Ubuntu.

It happened few times. First time one month ago but I upgraded kernel&firmware and I thought that it helped…

After a OS software reboot interfaces are visible again in OS.

What to check at first place?

From dmesg likely, PCIE lost, you need check try swap another pcie slot see if issue still happen.

Tried this. Not resolved my issue.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.