Synd 0x8: unrecoverable hardware error

recently, we’ve had multiple nodes in the cluster (20) report errors on boot:

[root@worker5582 ~]# dmesg | grep -i mlx
[ 7.501413] mlx5_core 0000:41:00.0: firmware version: 22.42.1000
[ 7.501442] mlx5_core 0000:41:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[ 7.785265] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged
[ 7.785517] mlx5_core 0000:41:00.0: mlx5_pcie_event:303:(pid 10): PCIe slot advertised sufficient power (75W).
[ 7.812268] mlx5_core 0000:63:00.0: firmware version: 16.35.3006
[ 7.812302] mlx5_core 0000:63:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 8.235961] mlx5_core 0000:63:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[ 8.240874] mlx5_core 0000:63:00.0: Port module event: module 0, Cable plugged
[ 8.241132] mlx5_core 0000:63:00.0: mlx5_pcie_event:303:(pid 10): PCIe slot advertised sufficient power (27W).
[ 8.282268] mlx5_core 0000:63:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 8.500033] mlx5_core 0000:63:00.1: firmware version: 16.35.3006
[ 8.500068] mlx5_core 0000:63:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 8.946633] mlx5_core 0000:63:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[ 8.951695] mlx5_core 0000:63:00.1: Port module event: module 1, Cable unplugged
[ 8.951956] mlx5_core 0000:63:00.1: mlx5_pcie_event:303:(pid 10): PCIe slot advertised sufficient power (27W).
[ 8.993130] mlx5_core 0000:63:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 9.206327] mlx5_core 0000:63:00.0 eno33: renamed from eth0
[ 9.242605] mlx5_core 0000:63:00.1 eno34: renamed from eth1
[ 9.480130] mlx5_core 0000:41:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 9.480136] mlx5_core 0000:41:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 14.795170] mlx5_core 0000:63:00.0 eno33: Link up
[ 17.982092] mlx5_core 0000:41:00.0: poll_health:852:(pid 0): device’s health compromised - reached miss count
[ 17.982118] mlx5_core 0000:41:00.0: print_health_info:442:(pid 0): Health issue observed, unrecoverable hardware error, severity(3) ERROR:
[ 17.982127] mlx5_core 0000:41:00.0: print_health_info:446:(pid 0): assert_var[0] 0x00010000
[ 17.982136] mlx5_core 0000:41:00.0: print_health_info:446:(pid 0): assert_var[1] 0x001af17c
[ 17.982145] mlx5_core 0000:41:00.0: print_health_info:446:(pid 0): assert_var[2] 0x00000000
[ 17.982151] mlx5_core 0000:41:00.0: print_health_info:446:(pid 0): assert_var[3] 0x00000000
[ 17.982157] mlx5_core 0000:41:00.0: print_health_info:446:(pid 0): assert_var[4] 0x00000000
[ 17.982162] mlx5_core 0000:41:00.0: print_health_info:446:(pid 0): assert_var[5] 0x00000000
[ 17.982167] mlx5_core 0000:41:00.0: print_health_info:448:(pid 0): assert_exit_ptr 0x20820414
[ 17.982173] mlx5_core 0000:41:00.0: print_health_info:449:(pid 0): assert_callra 0x208207b4
[ 17.982183] mlx5_core 0000:41:00.0: print_health_info:451:(pid 0): fw_ver 22.42.1000
[ 17.982189] mlx5_core 0000:41:00.0: print_health_info:452:(pid 0): time 0
[ 17.982195] mlx5_core 0000:41:00.0: print_health_info:453:(pid 0): hw_id 0x00000212
[ 17.982198] mlx5_core 0000:41:00.0: print_health_info:454:(pid 0): rfr 0
[ 17.982201] mlx5_core 0000:41:00.0: print_health_info:455:(pid 0): severity 3 (ERROR)
[ 17.982208] mlx5_core 0000:41:00.0: print_health_info:456:(pid 0): irisc_index 9
[ 17.982217] mlx5_core 0000:41:00.0: print_health_info:458:(pid 0): synd 0x8: unrecoverable hardware error
[ 17.982223] mlx5_core 0000:41:00.0: print_health_info:459:(pid 0): ext_synd 0x0079
[ 17.982232] mlx5_core 0000:41:00.0: print_health_info:460:(pid 0): raw fw_ver 0x162a03e8
[ 254.086172] mlx5_core 0000:63:00.0 eno33: Link up

I’ve tried updating the firmware already, and should be current.

[root@worker5582 ~]# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 22.42.1000
node_guid: 1070:fd03:00f1:6ebc
sys_image_guid: 1070:fd03:00f1:6ebc
vendor_id: 0x02c9
vendor_part_id: 4125
hw_ver: 0x0
board_id: MT_0000000903
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 65535
port_lmc: 0x00
link_layer: InfiniBand

hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 16.35.3006
node_guid: 1070:fd03:0056:c45c
sys_image_guid: 1070:fd03:0056:c45c
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: DEL0000000016
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 16.35.3006
node_guid: 1070:fd03:0056:c45d
sys_image_guid: 1070:fd03:0056:c45c
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: DEL0000000016
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

[root@worker5582 ~]#

any ideas where to look? I’ve searched the forums and I’m seeing mostly firmware related errors with firmware updates being the solution, but i’ve run mlxup and updated what it found already, rebooted, same error.

Thanks for any pointers.