Device's health compromised: firmware internal error

Hello, guys

Please help me to investigate following issue.
I have two Dell servers with two MCX4121A-ACA_Ax card installed. OS Ubuntu 22.04 LTS.
Recently I have done firmware upgrade to latest version: 14.32.1010 and now I have following error in dmesg on both of the servers for both of cards:

[ 14.944178] mlx5_core 0000:3b:00.0: poll_health:971:(pid 0): device’s health compromised - reached miss count
[ 14.946551] mlx5_core 0000:3b:00.0: print_health_info:491:(pid 0): Health issue observed, firmware internal error, severity(3) ERROR:
[ 14.951182] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[0] 0x00000000
[ 14.953561] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[1] 0x000000b9
[ 14.955588] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[2] 0x00000040
[ 14.957305] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[3] 0x00000000
[ 14.958959] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[4] 0x00000000
[ 14.960531] mlx5_core 0000:3b:00.0: print_health_info:495:(pid 0): assert_var[5] 0x00000000
[ 14.962061] mlx5_core 0000:3b:00.0: print_health_info:498:(pid 0): assert_exit_ptr 0x008771e4
[ 14.963559] mlx5_core 0000:3b:00.0: print_health_info:499:(pid 0): assert_callra 0x00810ba4
[ 14.964931] mlx5_core 0000:3b:00.0: print_health_info:500:(pid 0): fw_ver 14.32.1010
[ 14.965864] mlx5_core 0000:3b:00.0: print_health_info:502:(pid 0): time 0
[ 14.966787] mlx5_core 0000:3b:00.0: print_health_info:503:(pid 0): hw_id 0x0000020b
[ 14.967690] mlx5_core 0000:3b:00.0: print_health_info:504:(pid 0): rfr 0
[ 14.968584] mlx5_core 0000:3b:00.0: print_health_info:505:(pid 0): severity 3 (ERROR)
[ 14.969454] mlx5_core 0000:3b:00.0: print_health_info:506:(pid 0): irisc_index 2
[ 14.970310] mlx5_core 0000:3b:00.0: print_health_info:507:(pid 0): synd 0x1: firmware internal error
[ 14.971166] mlx5_core 0000:3b:00.0: print_health_info:509:(pid 0): ext_synd 0x805b
[ 14.972008] mlx5_core 0000:3b:00.0: print_health_info:510:(pid 0): raw fw_ver 0xe02003f2

Here is device hw query output:

flint -d 3b:00.1 -ocr hw query

-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
HwDevId 523
HwRevId 0x0
Flash Info:
Type W25QxxBV
TotalSize 0x1000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
QuadEn 1
Flash0.WriteProtected Top,8-SubSectors
JEDEC_ID 0x1840ef

Now I need to get these servers to production but think that this error can get me to network interruption issues.

Please help and advice.

Could anyone help on investigating this?

Hi Amogilny
Thank you for contacting us.

From the HW query ouput, we do not see any issue on the HW.
The error only occurs in HCA boot-up flow, and does not occur when HCA is running (after boot-up).
So the issue will not have impact on normal operation.
If you face same issue during normal operation, please re-install or upgrade the firmware.

Thank you,
NVIDIA Network Support

Hello Mansunc
Thanks for clearing this out.

Currently we do not have any impact on the network operation.
Just wanted to understand that this is not an issue to worry as our new servers are just starting to work in production.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.