One of the MCX653106A-ECA cards we have is giving a firmware internal error and does not initialize the network interfaces. Tried firmware reinstall (20.43.2026) and downgrade (20.35.4506) to no change. Comparing MFT outputs to the card that works, the differences are mostly in serial numbers, but also in “enabled link speeds” and “supported cable speeds”. See results below.
We are only using these cards in 40G Ethernet mode, and, being new to this, I am not sure I am using the right drivers. I installed mlnx-en-24.10-1.1.4.0-ubuntu24.04-x86_64. Would using the OFED or DOCA drivers instead make any difference?
[ 18.308254] mlx5_core 0000:01:00.0: poll_health:1082:(pid 0): device's health compromised - reached miss count
[ 18.308314] mlx5_core 0000:01:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 18.308332] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[ 18.308347] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[1] 0x00000000
[ 18.308360] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[ 18.308373] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[ 18.308386] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[ 18.308402] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[ 18.308416] mlx5_core 0000:01:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x212041a0
[ 18.308429] mlx5_core 0000:01:00.0: print_health_info:505:(pid 0): assert_callra 0x2120ae94
[ 18.308454] mlx5_core 0000:01:00.0: print_health_info:506:(pid 0): fw_ver 20.43.2026
[ 18.308470] mlx5_core 0000:01:00.0: print_health_info:508:(pid 0): time 0
[ 18.308485] mlx5_core 0000:01:00.0: print_health_info:509:(pid 0): hw_id 0x0000020f
[ 18.308494] mlx5_core 0000:01:00.0: print_health_info:510:(pid 0): rfr 0
[ 18.308501] mlx5_core 0000:01:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[ 18.308515] mlx5_core 0000:01:00.0: print_health_info:512:(pid 0): irisc_index 6
[ 18.308535] mlx5_core 0000:01:00.0: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[ 18.308549] mlx5_core 0000:01:00.0: print_health_info:515:(pid 0): ext_synd 0x8a02
[ 18.308564] mlx5_core 0000:01:00.0: print_health_info:516:(pid 0): raw fw_ver 0x142b07ea
root@linux7:~# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success
root@linux7:~# mst status -v
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX6(rev:0) /dev/mst/mt4123_pciconf0 01:00.0 -1
ConnectX6(rev:0) /dev/mst/mt4123_pciconf0.1 01:00.1 -1
root@linux7:~# mlxburn query -d /dev/mst/mt4123_pciconf0
-I- Image type: FS4
-I- FW Version: 20.35.4506
-I- FW Release Date: 22.12.2024
-I- Product Version: 20.35.4506
-I- Rom Info: type=UEFI version=14.29.15 cpu=AMD64,AARCH64
-I- type=PXE version=3.6.902 cpu=AMD64
-I- Description: UID GuidsNumber
-I- Base GUID: b83fd203007b4926 8
-I- Base MAC: b83fd27b4926 8
-I- Image VSD: N/A
-I- Device VSD: N/A
-I- PSID: MT_0000000224
-I- Security Attributes: N/A
root@linux7:~# mlxburn vpd -d /dev/mst/mt4123_pciconf0
VPD-KEYWORD DESCRIPTION VALUE
----------- ----------- -----
Read Only Section:
PN Part Number MCX653106A-ECAT
EC Revision AG
V2 N/A MCX653106A-ECAT
SN Serial Number MT2242T00FA2
V3 N/A b4069adc6c4aed118000b83fd27b4926
VA N/A MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653106A
V0 Misc Info PCIeGen4 x16
VU N/A MT2242T00FA2MLNXS0D0F0
RV Checksum Complement 0xa1
IDTAG Board Id ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), dual-port QSFP56
root@linux7:~# mget_temp -d /dev/mst/mt4123_pciconf0
53
root@linux7:~# flint -d /dev/mst/mt4123_pciconf0 hw query
HW Info:
HwDevId 527
HwRevId 0x0
Flash Info:
Type MX25Lxxx
TotalSize 0x2000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x80
CmdSet 0x80
JEDEC_ID 0x1920c2
root@linux7:~# mlxlink -d /dev/mst/mt4123_pciconf0 -p 1
Operational Info
----------------
State : Polling
Physical state : Disabled
Speed : N/A
Width : N/A
FEC : N/A
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed : 0x00000005 (QDR,SDR)
Supported Cable Speed : 0x00000007 (QDR,DDR,SDR)
Troubleshooting Info
--------------------
Status Opcode : 2
Group Opcode : PHY FW
Recommendation : Negotiation failure
Tool Information
----------------
Firmware Version : 20.35.4506
amBER Version : 3.6
MFT Version : mft 4.30.1-8
The working card has
root@linux7:~# mlxlink -d /dev/mst/mt4123_pciconf0 -p 1
Operational Info
----------------
State : Active
Physical state : ETH_AN_FSM_ENABLE
Speed : 40G
Width : 4x
FEC : No FEC
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed (Ext.) : 0x000007f2 (100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Supported Cable Speed (Ext.) : 0x00000030 (40G,10G)
Troubleshooting Info
--------------------
Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed
Tool Information
----------------
Firmware Version : 20.43.2026
amBER Version : 3.6
MFT Version : mft 4.30.1-8