ConnectX-7 Cards stuck, irisc not responding

Hi, I have 3 ConnectX-7 cards. Two are attached to an ASUS Pro WS WRX90E-SAGE SE Mainboard each (2 different Servers), the other one is placed on an ASUS Pro WS W790E-SAGE SE. All three mainboards have the latest BIOS installed.
The cards (MCX755106AS-HEA_Ax) have the latest available firmware from mlxfwmanager (28.43.1014).
The link layer is set to Infiniband on all ports.
My OS is Ubuntu 24.04 on all machines.

/opt/mellanox/doca/tools/doca-info

Versions:

  • MFT 4.30.1-8
  • DOCA Base (OFED) MLNX_OFED_LINUX-24.10-1.1.4.0
  • DOCA

UEFI\ATF versions:

  • mst_device: mt41692_pciconf[0-9]
    UEFI Version: N\A
    ATF Version: N\A

/opt/mellanox/doca/tools/doca-info: line 81: BF_FW_ARR: bad array subscript
Firmware (Current):

  • 28.43.1014

SNAP3:

  • mlnx-libsnap NA (package not found)
  • mlnx-snap NA (package not found)
  • spdk NA (package not found)

DOCA:

  • doca-all 2.9.1-0.1.9
  • doca-apsh-config 2.9.1008-1
  • doca-bench 2.9.1008-1
  • doca-caps 2.9.1008-1
  • doca-comm-channel-admin 2.9.1008-1
  • doca-devel 2.9.1-0.1.9
  • doca-dms 2.9.1008-1
  • doca-extra 0.1.7-1
  • doca-flow-tune 2.9.1008-1
  • doca-host 2.9.1-018000-24.10-ubuntu2404
  • doca-ofed 2.9.1-0.1.9
  • doca-openvswitch-common 2.9.1-0013-24.11-based-3.3.3
  • doca-openvswitch-switch 2.9.1-0013-24.11-based-3.3.3
  • doca-pcc-counters 2.9.1008-1
  • doca-runtime 2.9.1-0.1.9
  • doca-samples 2.9.1008-1
  • doca-sdk-aes-gcm 2.9.1008-1
  • doca-sdk-apsh 2.9.1008-1
  • doca-sdk-argp 2.9.1008-1
  • doca-sdk-comch 2.9.1008-1
  • doca-sdk-common 2.9.1008-1
  • doca-sdk-compress 2.9.1008-1
  • doca-sdk-devemu 2.9.1008-1
  • doca-sdk-dma 2.9.1008-1
  • doca-sdk-dpa 2.9.1008-1
  • doca-sdk-dpdk-bridge 2.9.1008-1
  • doca-sdk-erasure-coding 2.9.1008-1
  • doca-sdk-eth 2.9.1008-1
  • doca-sdk-flow 2.9.1008-1
  • doca-sdk-pcc 2.9.1008-1
  • doca-sdk-rdma 2.9.1008-1
  • doca-sdk-sha 2.9.1008-1
  • doca-sdk-telemetry 2.9.1008-1
  • doca-sdk-telemetry-exporter 2.9.1008-1
  • doca-sdk-urom 2.9.1008-1
  • doca-sha-offload-engine 2.9.1008-1
  • doca-socket-relay 2.9.1008-1
  • doca-sosreport 4.8.0-1
  • dpacc 1.9.0
  • dpacc-extract 1.9.0
  • flexio-samples 24.10.2454
  • flexio-sdk 24.10.2454
  • libdoca-sdk-aes-gcm-dev 2.9.1008-1
  • libdoca-sdk-apsh-dev 2.9.1008-1
  • libdoca-sdk-argp-dev 2.9.1008-1
  • libdoca-sdk-comch-dev 2.9.1008-1
  • libdoca-sdk-common-dev 2.9.1008-1
  • libdoca-sdk-compress-dev 2.9.1008-1
  • libdoca-sdk-devemu-dev 2.9.1008-1
  • libdoca-sdk-dma-dev 2.9.1008-1
  • libdoca-sdk-dpa-dev 2.9.1008-1
  • libdoca-sdk-dpdk-bridge-dev 2.9.1008-1
  • libdoca-sdk-erasure-coding-dev 2.9.1008-1
  • libdoca-sdk-eth-dev 2.9.1008-1
  • libdoca-sdk-flow-dev 2.9.1008-1
  • libdoca-sdk-flow-trace 2.9.1008-1
  • libdoca-sdk-pcc-dev 2.9.1008-1
  • libdoca-sdk-rdma-dev 2.9.1008-1
  • libdoca-sdk-sha-dev 2.9.1008-1
  • libdoca-sdk-telemetry-dev 2.9.1008-1
  • libdoca-sdk-telemetry-exporter-dev 2.9.1008-1
  • libdoca-sdk-urom-dev 2.9.1008-1
  • python3-doca-openvswitch 2.9.1-0013-24.11-based-3.3.3

DOCA Dependencies:

  • collectx-clxapi 1.19.1
  • dpaeumgmt 24.10.2407
  • dpa-gdbserver 24.10.2454
  • dpa-stats 24.10.2407
  • flexio-sdk 24.10.2454
  • mlnx-dpdk 22.11.0-2410.1.0.2410114

OFED:

  • doca-openvswitch-common 2.9.1-0013-24.11-based-3.3.3
  • doca-openvswitch-switch 2.9.1-0013-24.11-based-3.3.3
  • dpcp 1.1.50-1.2410068
  • hcoll 4.8.3230-1.2410068
  • ibacm 2410mlnx54-1.2410068
  • ibarr:amd64 0.1.3-1.2410068
  • ibdump 6.0.0-1.2410068
  • ibsim 0.12-1.2410068
  • ibsim-doc 0.12-1.2410068
  • ibutils2 2.1.1-0.21902.MLNX20241029.g46cf6278.2410068
  • ibverbs-providers:amd64 2410mlnx54-1.2410068
  • ibverbs-utils 2410mlnx54-1.2410068
  • infiniband-diags 2410mlnx54-1.2410068
  • iser-dkms 24.10.OFED.24.10.1.1.4.1-1
  • isert-dkms 24.10.OFED.24.10.1.1.4.1-1
  • kernel-mft-dkms 4.30.1.8-1
  • knem 1.1.4.90mlnx3-OFED.23.10.0.2.1.1
  • knem-dkms 1.1.4.90mlnx3-OFED.23.10.0.2.1.1
  • libarray-intspan-perl 2.004-2
  • libibmad-dev:amd64 2410mlnx54-1.2410068
  • libibmad5:amd64 2410mlnx54-1.2410068
  • libibnetdisc5:amd64 2410mlnx54-1.2410068
  • libibnetdisc5t64:amd64 50.0-2build2
  • libibumad-dev:amd64 2410mlnx54-1.2410068
  • libibumad3:amd64 2410mlnx54-1.2410068
  • libibverbs-dev:amd64 2410mlnx54-1.2410068
  • libibverbs1:amd64 2410mlnx54-1.2410068
  • libopensm 5.21.0.MLNX20241126.d9aa3dff-0.1.2410114
  • libopensm-devel 5.21.0.MLNX20241126.d9aa3dff-0.1.2410114
  • librdmacm-dev:amd64 2410mlnx54-1.2410068
  • librdmacm1:amd64 2410mlnx54-1.2410068
  • libsharpyuv0:amd64 1.3.2-0.4build3
  • libxpmem-dev:amd64 2.7-0.2310055
  • libxpmem0:amd64 2.7-0.2310055
  • mlnx-dpdk 22.11.0-2410.1.0.2410114
  • mlnx-dpdk-dev:amd64 22.11.0-2410.1.0.2410114
  • mlnx-ethtool 6.9-1.2410068
  • mlnx-iproute2 6.10.0-1.2410114
  • mlnx-ofed-kernel-dkms 24.10.OFED.24.10.1.1.4.1-1
  • mlnx-ofed-kernel-utils 24.10.OFED.24.10.1.1.4.1-1
  • mlnx-tools 24.10-0.2410068
  • mpitests 3.2.24-2ffc2d6.2410068
  • openmpi 4.1.7rc1-1.2410068
  • opensm 5.21.0.MLNX20241126.d9aa3dff-0.1.2410114
  • opensm-doc 5.21.0.MLNX20241126.d9aa3dff-0.1.2410114
  • perftest 24.10.0-0.65.g9093bae.2410068
  • rdma-core 2410mlnx54-1.2410068
  • rdmacm-utils 2410mlnx54-1.2410068
  • rshim 2.1.8-0.g5e3709e.2410114
  • sharp 3.9.0.MLNX20241029.7a20b607-1.2410068
  • srp-dkms 24.10.OFED.24.10.1.1.4.1-1
  • srptools 2410mlnx54-1.2410068
  • ucx 1.18.0-1.2410068
  • xpmem 2.7.4-1.2410068
  • xpmem-dkms 2.7.4-1.2410068

When plugging a cable, the respective ports get stuck in Polling/State Down state and these errors show:
[ 3.936667] mlx_compat: loading out-of-tree module taints kernel.
[ 3.937711] mlx_compat: module verification failed: signature and/or required key missing - tainting kernel
[ 5.102875] mlx5_core 0000:34:00.0: firmware version: 28.43.1014
[ 5.103750] mlx5_core 0000:34:00.0: 504.112 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x16 link)
[ 5.425141] mlx5_core 0000:34:00.0: Port module event: module 0, Cable unplugged
[ 5.426223] mlx5_core 0000:34:00.0: mlx5_pcie_event:295:(pid 387): PCIe slot power capability was not advertised.
[ 5.435515] mlx5_core 0000:34:00.1: firmware version: 28.43.1014
[ 5.436364] mlx5_core 0000:34:00.1: 504.112 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x16 link)
[ 5.761570] mlx5_core 0000:34:00.1: Port module event: module 1, Cable unplugged
[ 5.762658] mlx5_core 0000:34:00.1: mlx5_pcie_event:295:(pid 386): PCIe slot power capability was not advertised.
[ 10.216247] mlx5_core 0000:34:00.0 ibs8191f0: renamed from ib0
[ 10.270973] mlx5_core 0000:34:00.1 ibs8191f1: renamed from ib0
[ 222.502898] mlx5_core 0000:34:00.0: Port module event: module 0, Cable plugged
[ 222.503472] mlx5_core 0000:34:00.0: mlx5_pcie_event:304:(pid 388): PCIe slot advertised sufficient power (44W).
[ 222.503481] mlx5_core 0000:34:00.1: mlx5_pcie_event:304:(pid 385): PCIe slot advertised sufficient power (44W).
[ 232.717491] mlx5_core 0000:34:00.1: poll_health:1082:(pid 0): device’s health compromised - reached miss count
[ 232.717531] mlx5_core 0000:34:00.1: print_health_info:497:(pid 0): Health issue observed, irisc not responding, severity(3) ERROR:
[ 232.717547] mlx5_core 0000:34:00.1: print_health_info:501:(pid 0): assert_var[0] 0x00000001
[ 232.717559] mlx5_core 0000:34:00.1: print_health_info:501:(pid 0): assert_var[1] 0x21283c2c
[ 232.717570] mlx5_core 0000:34:00.1: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 232.717581] mlx5_core 0000:34:00.1: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 232.717592] mlx5_core 0000:34:00.1: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 232.717603] mlx5_core 0000:34:00.1: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 232.717614] mlx5_core 0000:34:00.1: print_health_info:504:(pid 0): assert_exit_ptr 0x2154a710
[ 232.717626] mlx5_core 0000:34:00.1: print_health_info:505:(pid 0): assert_callra 0x21458058
[ 232.717644] mlx5_core 0000:34:00.1: print_health_info:506:(pid 0): fw_ver 28.43.1014
[ 232.717658] mlx5_core 0000:34:00.1: print_health_info:508:(pid 0): time 1738157961
[ 232.717671] mlx5_core 0000:34:00.1: print_health_info:509:(pid 0): hw_id 0x00000218
[ 232.717680] mlx5_core 0000:34:00.1: print_health_info:510:(pid 0): rfr 0
[ 232.717689] mlx5_core 0000:34:00.1: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 232.717701] mlx5_core 0000:34:00.1: print_health_info:512:(pid 0): irisc_index 9
[ 232.717715] mlx5_core 0000:34:00.1: print_health_info:513:(pid 0): synd 0x7: irisc not responding
[ 232.717728] mlx5_core 0000:34:00.1: print_health_info:515:(pid 0): ext_synd 0x4150
[ 232.717740] mlx5_core 0000:34:00.1: print_health_info:516:(pid 0): raw fw_ver 0x1c2b03f6
[ 232.845491] mlx5_core 0000:34:00.0: poll_health:1082:(pid 0): device’s health compromised - reached miss count
[ 232.845527] mlx5_core 0000:34:00.0: print_health_info:497:(pid 0): Health issue observed, irisc not responding, severity(3) ERROR:
[ 232.845543] mlx5_core 0000:34:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000001
[ 232.845555] mlx5_core 0000:34:00.0: print_health_info:501:(pid 0): assert_var[1] 0x21283c2c
[ 232.845566] mlx5_core 0000:34:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 232.845577] mlx5_core 0000:34:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 232.845588] mlx5_core 0000:34:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 232.845599] mlx5_core 0000:34:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 232.845612] mlx5_core 0000:34:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x2154a710
[ 232.845624] mlx5_core 0000:34:00.0: print_health_info:505:(pid 0): assert_callra 0x21458058
[ 232.845639] mlx5_core 0000:34:00.0: print_health_info:506:(pid 0): fw_ver 28.43.1014
[ 232.845653] mlx5_core 0000:34:00.0: print_health_info:508:(pid 0): time 1738157961
[ 232.845666] mlx5_core 0000:34:00.0: print_health_info:509:(pid 0): hw_id 0x00000218
[ 232.845676] mlx5_core 0000:34:00.0: print_health_info:510:(pid 0): rfr 0
[ 232.845685] mlx5_core 0000:34:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 232.845697] mlx5_core 0000:34:00.0: print_health_info:512:(pid 0): irisc_index 9
[ 232.845711] mlx5_core 0000:34:00.0: print_health_info:513:(pid 0): synd 0x7: irisc not responding
[ 232.845724] mlx5_core 0000:34:00.0: print_health_info:515:(pid 0): ext_synd 0x4150
[ 232.845736] mlx5_core 0000:34:00.0: print_health_info:516:(pid 0): raw fw_ver 0x1c2b03f6

This happens on all machines. If that happens, trying to reboot the machine results in it getting stuck at POST Code 92 or 99, indicating issues with the PCIe device. Only powercycling the machine allows for a correct boot.

Any help is greatly appreciated.

Hello ~
Were there any OS related issue ? such as CPU stuck/hang or just other weired log just before this?
You must open a technical case on it.

/HyungKwang

Hi HyungKwang, thank you for your answer. No, nothing uncommon or surprising in the log right before.
I wanted to open a ticket on https://nvid.nvidia.com/ but I cannot continue after registration, as it gives me a forward to a page displaying “not a valid user”.
Do you have any link for me on where I should open the ticket?

Thanks and best
Alex

Hi

You can open a case at NVIDIA Enterprise Support Portal
(EnterpriseSupport)
once you open a case, please collect & upload sysinfo-snapshot file.
( run at the problematic server # sysinfo-snapshot.py)

If you don’t have an account id in NVIDIA Enterprise Support Portal to open a case, or if your HCA does not have valid serivce entitlement either,
You’d better contact that the resaller who you purchased the adapter from.

/HyungKwang