ConnectX-6 firmware internal error

One of the MCX653106A-ECA cards we have is giving a firmware internal error and does not initialize the network interfaces. Tried firmware reinstall (20.43.2026) and downgrade (20.35.4506) to no change. Comparing MFT outputs to the card that works, the differences are mostly in serial numbers, but also in “enabled link speeds” and “supported cable speeds”. See results below.
We are only using these cards in 40G Ethernet mode, and, being new to this, I am not sure I am using the right drivers. I installed mlnx-en-24.10- Would using the OFED or DOCA drivers instead make any difference?

[   18.308254] mlx5_core 0000:01:00.0: poll_health:1082:(pid 0): device's health compromised - reached miss count
[   18.308314] mlx5_core 0000:01:00.0: print_health_info:497:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[   18.308332] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[0] 0x00000000
[   18.308347] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[1] 0x00000000
[   18.308360] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[2] 0x00000000
[   18.308373] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[3] 0x00000000
[   18.308386] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[4] 0x00000000
[   18.308402] mlx5_core 0000:01:00.0: print_health_info:501:(pid 0): assert_var[5] 0x00000000
[   18.308416] mlx5_core 0000:01:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x212041a0
[   18.308429] mlx5_core 0000:01:00.0: print_health_info:505:(pid 0): assert_callra 0x2120ae94
[   18.308454] mlx5_core 0000:01:00.0: print_health_info:506:(pid 0): fw_ver 20.43.2026
[   18.308470] mlx5_core 0000:01:00.0: print_health_info:508:(pid 0): time 0
[   18.308485] mlx5_core 0000:01:00.0: print_health_info:509:(pid 0): hw_id 0x0000020f
[   18.308494] mlx5_core 0000:01:00.0: print_health_info:510:(pid 0): rfr 0
[   18.308501] mlx5_core 0000:01:00.0: print_health_info:511:(pid 0): severity 3 (ERROR)
[   18.308515] mlx5_core 0000:01:00.0: print_health_info:512:(pid 0): irisc_index 6
[   18.308535] mlx5_core 0000:01:00.0: print_health_info:513:(pid 0): synd 0x1: firmware internal error
[   18.308549] mlx5_core 0000:01:00.0: print_health_info:515:(pid 0): ext_synd 0x8a02
[   18.308564] mlx5_core 0000:01:00.0: print_health_info:516:(pid 0): raw fw_ver 0x142b07ea
root@linux7:~# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success
root@linux7:~# mst status -v
MST modules:
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
DEVICE_TYPE             MST                           PCI       RDMA            NET                                     NUMA  
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf0      01:00.0                                           -1    

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf0.1    01:00.1                                           -1    

root@linux7:~# mlxburn query -d /dev/mst/mt4123_pciconf0
-I- Image type:            FS4
-I- FW Version:            20.35.4506
-I- FW Release Date:       22.12.2024
-I- Product Version:       20.35.4506
-I- Rom Info:              type=UEFI version=14.29.15 cpu=AMD64,AARCH64
-I-                        type=PXE version=3.6.902 cpu=AMD64
-I- Description:           UID                GuidsNumber
-I- Base GUID:             b83fd203007b4926        8
-I- Base MAC:              b83fd27b4926            8
-I- Image VSD:             N/A
-I- Device VSD:            N/A
-I- PSID:                  MT_0000000224
-I- Security Attributes:   N/A
root@linux7:~# mlxburn vpd -d /dev/mst/mt4123_pciconf0

  VPD-KEYWORD    DESCRIPTION             VALUE                     
  -----------    -----------             -----                     
Read Only Section:

  PN             Part Number             MCX653106A-ECAT           
  EC             Revision                AG                        
  V2             N/A                     MCX653106A-ECAT           
  SN             Serial Number           MT2242T00FA2              
  V3             N/A                     b4069adc6c4aed118000b83fd27b4926
  VA             N/A                     MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653106A      
  V0             Misc Info               PCIeGen4 x16              
  VU             N/A                     MT2242T00FA2MLNXS0D0F0    
  RV             Checksum Complement     0xa1                      
  IDTAG          Board Id                ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), dual-port QSFP56                                                                                                    

root@linux7:~# mget_temp -d /dev/mst/mt4123_pciconf0
root@linux7:~# flint -d /dev/mst/mt4123_pciconf0 hw query
HW Info:
  HwDevId                 527
  HwRevId                 0x0
Flash Info:
  Type                    MX25Lxxx
  TotalSize               0x2000000
  Banks                   0x1
  SectorSize              0x1000
  WriteBlockSize          0x80
  CmdSet                  0x80
  JEDEC_ID                0x1920c2
root@linux7:~# mlxlink -d /dev/mst/mt4123_pciconf0 -p 1

Operational Info
State                              : Polling 
Physical state                     : Disabled 
Speed                              : N/A 
Width                              : N/A 
FEC                                : N/A 
Loopback Mode                      : No Loopback 
Auto Negotiation                   : ON 

Supported Info
Enabled Link Speed                 : 0x00000005 (QDR,SDR) 
Supported Cable Speed              : 0x00000007 (QDR,DDR,SDR) 

Troubleshooting Info
Status Opcode                      : 2 
Group Opcode                       : PHY FW 
Recommendation                     : Negotiation failure 

Tool Information
Firmware Version                   : 20.35.4506 
amBER Version                      : 3.6 
MFT Version                        : mft 4.30.1-8 

The working card has

root@linux7:~# mlxlink -d /dev/mst/mt4123_pciconf0 -p 1

Operational Info
State                              : Active 
Physical state                     : ETH_AN_FSM_ENABLE 
Speed                              : 40G 
Width                              : 4x 
FEC                                : No FEC 
Loopback Mode                      : No Loopback 
Auto Negotiation                   : ON 

Supported Info
Enabled Link Speed (Ext.)          : 0x000007f2 (100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G) 
Supported Cable Speed (Ext.)       : 0x00000030 (40G,10G) 

Troubleshooting Info
Status Opcode                      : 0 
Group Opcode                       : N/A 
Recommendation                     : No issue was observed 

Tool Information
Firmware Version                   : 20.43.2026 
amBER Version                      : 3.6 
MFT Version                        : mft 4.30.1-8 

The error means that the device is configured to work with flash in dual mode instead of quad mode as expected.
Try to change it to quad mode:
If it doesn’t work I suggest contacting Nvidia support for further assistance.

Thanks, that was it.

I needed to re-install mft with --oem to get access to the flint hw set command. After enabling quad mode, the firmware booted without an error.

Then I just needed to set LINK_TYPE_P1/2 to ETH(2) for the interfaces to come up with Ethernet.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.