Mlx5_core device's health compromised - reached miss count on latest firmware (14.32.1900)

Hi,

I updated my connectX4-LX to the latest available firmware, and now when I boot I get very unhappy messages in dmesg. What does this mean? I cannot use VPP/DPDK/XDP, it crashes the system.

# mlxfwmanager --query
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX4LX
  Part Number:      MCX4121A-ACA_Ax
  Description:      ConnectX-4 Lx EN network interface card; 25GbE dual-port SFP28; PCIe3.0 x8; ROHS R6
  PSID:             MT_2420110034
  PCI Device Name:  0000:01:00.0
  Base MAC:         506b4b297f7c
  Versions:         Current        Available     
     FW             14.32.1900     14.32.1900    
     PXE            3.6.0502       3.6.0502      
     UEFI           14.25.0017     14.25.0017    

  Status:           Up to date
$ dmesg
...
[    9.960131] mlx5_core 0000:01:00.0: poll_health:1082:(pid 0): device's health compromised - reached miss count
[    9.960151] mlx5_core 0000:01:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[    9.960158] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[    9.960163] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[1] 0xbadc0ffe
[    9.960169] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[    9.960174] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[    9.960179] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[    9.960184] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[    9.960189] mlx5_core 0000:01:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x00874eac
[    9.960195] mlx5_core 0000:01:00.0: print_health_info:505:(pid 0): assert_callra 0x00876e08
[    9.960203] mlx5_core 0000:01:00.0: print_health_info:506:(pid 0): fw_ver 14.32.1900
[    9.960209] mlx5_core 0000:01:00.0: print_health_info:508:(pid 0): time 1745607553
[    9.960214] mlx5_core 0000:01:00.0: print_health_info:509:(pid 0): hw_id 0x0000020b
[    9.960217] mlx5_core 0000:01:00.0: print_health_info:510:(pid 0): rfr 0
[    9.960219] mlx5_core 0000:01:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[    9.960225] mlx5_core 0000:01:00.0: print_health_info:512:(pid 0): irisc_index 2
[    9.960232] mlx5_core 0000:01:00.0: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[    9.960237] mlx5_core 0000:01:00.0: print_health_info:515:(pid 0): ext_synd 0x8a47
[    9.960243] mlx5_core 0000:01:00.0: print_health_info:516:(pid 0): raw fw_ver 0xe020076c
[   10.344098] mlx5_core 0000:01:00.1: poll_health:1082:(pid 0): device's health compromised - reached miss count
[   10.344117] mlx5_core 0000:01:00.1: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[   10.344126] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[   10.344132] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[1] 0xbadc0ffe
[   10.344138] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[   10.344144] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[   10.344150] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[   10.344155] mlx5_core 0000:01:00.1: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[   10.344160] mlx5_core 0000:01:00.1: print_health_info:504:(pid 0): assert_exit_ptr 0x00874eac
[   10.344166] mlx5_core 0000:01:00.1: print_health_info:505:(pid 0): assert_callra 0x00876e08
[   10.344175] mlx5_core 0000:01:00.1: print_health_info:506:(pid 0): fw_ver 14.32.1900
[   10.344181] mlx5_core 0000:01:00.1: print_health_info:508:(pid 0): time 1745607553
[   10.344187] mlx5_core 0000:01:00.1: print_health_info:509:(pid 0): hw_id 0x0000020b
[   10.344191] mlx5_core 0000:01:00.1: print_health_info:510:(pid 0): rfr 0
[   10.344195] mlx5_core 0000:01:00.1: print_health_info:511:(pid 0): severity 3 (ERROR)
[   10.344201] mlx5_core 0000:01:00.1: print_health_info:512:(pid 0): irisc_index 2
[   10.344210] mlx5_core 0000:01:00.1: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[   10.344216] mlx5_core 0000:01:00.1: print_health_info:515:(pid 0): ext_synd 0x8a47
[   10.344222] mlx5_core 0000:01:00.1: print_health_info:516:(pid 0): raw fw_ver 0xe020076c
# flint -d 01:00.0 -ocr hw query

-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
  HwDevId                 523
  HwRevId                 0x0
Flash Info:
  Type                    W25QxxBV
  TotalSize               0x1000000
  Banks                   0x1
  SectorSize              0x1000
  WriteBlockSize          0x10
  CmdSet                  0x80
  QuadEn                  1
  Flash0.WriteProtected   Disabled
  JEDEC_ID                0x1840ef
  TBS, BP[3:0]            0, 0000

maybe you can update your ofed version try to solove this problem.
Mostly this kind of error caused by the FW version not match the driver version .
Your most important errcode is: ext_synd 0x8a47, but we can’t get the meaning unless some documents export to us.

Hmm what documents do you need? I have installed from the latest OFED release MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu23.10-x86_64

edit: I followed your clue about fw mismatch and find the UEFI fw does not match the FW - is it supposed to?:

root@snap:~# mlxfwmanager 
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX4LX
  Part Number:      MCX4121A-ACA_Ax
  Description:      ConnectX-4 Lx EN network interface card; 25GbE dual-port SFP28; PCIe3.0 x8; ROHS R6
  PSID:             MT_2420110034
  PCI Device Name:  /dev/mst/mt4117_pciconf0
  Base MAC:         506b4b297f7c
  Versions:         Current        Available     
     FW             14.32.1900     N/A           
     PXE            3.6.0502       N/A           
     UEFI           14.25.0017     N/A           

  Status:           No matching image found

The driver version reported is 25.01-0.6.0:

# modinfo mlx5_core
filename:       /lib/modules/6.8.12-9-pve/updates/dkms/mlx5_core.ko
alias:          auxiliary:mlx5_core.eth-rep
alias:          auxiliary:mlx5_core.eth
basedon:        Korg 6.12-rc2
version:        25.01-0.6.0

I updated from MLNX_OFED to the newest DOCA-HOST yet the problem persists. Is the latest FW not compatible with the lastest drivers in DOCA-HOST?

# apt list --installed | grep doca

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

doca-caps/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-flow-tune/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-host/now 2.10.0-093000-25.01-debian125 amd64 [installed,local]
doca-networking-devel/DOCA-HOST-2.10.0,now 2.10.0-0.5.3 amd64 [installed,automatic]
doca-networking-runtime/DOCA-HOST-2.10.0,now 2.10.0-0.5.3 amd64 [installed,automatic]
doca-networking/DOCA-HOST-2.10.0,now 2.10.0-0.5.3 amd64 [installed]
doca-ofed/DOCA-HOST-2.10.0,now 2.10.0-0.5.3 amd64 [installed,automatic]
doca-openvswitch-common/DOCA-HOST-2.10.0,now 2.10.0-0056-25.01-based-3.3.4 amd64 [installed,automatic]
doca-openvswitch-switch/DOCA-HOST-2.10.0,now 2.10.0-0056-25.01-based-3.3.4 amd64 [installed,automatic]
doca-samples/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-sdk-argp/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-sdk-common/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-sdk-dpdk-bridge/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-sdk-flow/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-sdk-telemetry/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
doca-sosreport/DOCA-HOST-2.10.0,now 4.8.1 amd64 [installed,automatic]
libdoca-sdk-argp-dev/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
libdoca-sdk-common-dev/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
libdoca-sdk-dpdk-bridge-dev/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
libdoca-sdk-flow-dev/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
libdoca-sdk-flow-trace/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
libdoca-sdk-telemetry-dev/DOCA-HOST-2.10.0,now 2.10.0087-1 amd64 [installed,automatic]
python3-doca-openvswitch/DOCA-HOST-2.10.0,now 2.10.0-0056-25.01-based-3.3.4 amd64 [installed,automatic]

As this document(connectx4lxfirmwarev14321900/firmware+compatible) described, maybe you can test OFED 5.5 or 5.4 which published at 2021

hi david, what’s your OFED version when you first time report this issue?

As stated in my previous post, the most recent from MLNX_OFED

Perhaps LTS releases are not compatible with recent FW?

It’s a normal business choice, just like latest iOS not support iPhone4.
maybe CX4-Lx has been EOL(end of life).
from official document described, newest FW still support old OFED as described, that’s lucky for us.

Is the version 5.4 OFED with the latest firmware available for you?
Looking forward to your experiments and reply.

Hello @ david.southwick,

Thank you for posting your query on our community. The error message likely indicates that the firmware got stuck during initialization and failed the health watchdog check. Could you please try a power cycle of the server to check if it resolves the issue? If not, please try re-burning the FW image onto the card.

Thanks,
Bhargavi

Thanks SeekerV - unfortunately Nvidia no longer offers 5.4 or 5.5 which correspond in your document. The closest is 5.8 (Linux InfiniBand Drivers). That being said, a few power cycles and the error has disappeared (using DOCA drivers).

I really wish nvidia would quit renaming things and quit moving/breaking links so that nothing is ever in the same place.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.