Cards are giving errors after firmware update while booting

Hi,

I have updated 4 cards in my system by mlxup. After it, i see errors while ubuntu is booting up. should i be worried about it ?

thanks

[ 38.668933] mlx5_core 0000:81:00.1: poll_health:739:(pid 0): device’s health compromised - reached miss count
[ 38.669777] mlx5_core 0000:81:00.1: print_health_info:386:(pid 0): assert_var[0] 0x00000000
[ 38.670515] mlx5_core 0000:81:00.1: print_health_info:386:(pid 0): assert_var[1] 0xbadc0ffe
[ 38.671247] mlx5_core 0000:81:00.1: print_health_info:386:(pid 0): assert_var[2] 0x00000000
[ 38.671962] mlx5_core 0000:81:00.1: print_health_info:386:(pid 0): assert_var[3] 0x00000000
[ 38.672662] mlx5_core 0000:81:00.1: print_health_info:386:(pid 0): assert_var[4] 0x00000000
[ 38.673343] mlx5_core 0000:81:00.1: print_health_info:389:(pid 0): assert_exit_ptr 0x0088a288
[ 38.674008] mlx5_core 0000:81:00.1: print_health_info:391:(pid 0): assert_callra 0x0088c410
[ 38.674667] mlx5_core 0000:81:00.1: print_health_info:394:(pid 0): fw_ver 16.35.4030
[ 38.675304] mlx5_core 0000:81:00.1: print_health_info:395:(pid 0): hw_id 0x0000020d
[ 38.675940] mlx5_core 0000:81:00.1: print_health_info:396:(pid 0): irisc_index 8
[ 38.676575] mlx5_core 0000:81:00.1: print_health_info:397:(pid 0): synd 0x1: firmware internal error
[ 38.677209] mlx5_core 0000:81:00.1: print_health_info:399:(pid 0): ext_synd 0x8a47
[ 38.677842] mlx5_core 0000:81:00.1: print_health_info:401:(pid 0): raw fw_ver 0x10230fbe
[ 39.180928] mlx5_core 0000:82:00.0: poll_health:739:(pid 0): device’s health compromised - reached miss count
[ 39.181608] mlx5_core 0000:82:00.0: print_health_info:386:(pid 0): assert_var[0] 0x00000000
[ 39.182261] mlx5_core 0000:82:00.0: print_health_info:386:(pid 0): assert_var[1] 0xbadc0ffe
[ 39.182903] mlx5_core 0000:82:00.0: print_health_info:386:(pid 0): assert_var[2] 0x00000000
[ 39.183543] mlx5_core 0000:82:00.0: print_health_info:386:(pid 0): assert_var[3] 0x00000000
[ 39.184171] mlx5_core 0000:82:00.0: print_health_info:386:(pid 0): assert_var[4] 0x00000000
[ 39.184787] mlx5_core 0000:82:00.0: print_health_info:389:(pid 0): assert_exit_ptr 0x0088a288
[ 39.185398] mlx5_core 0000:82:00.0: print_health_info:391:(pid 0): assert_callra 0x0088c410
[ 39.186014] mlx5_core 0000:82:00.0: print_health_info:394:(pid 0): fw_ver 16.35.4030
[ 39.186612] mlx5_core 0000:82:00.0: print_health_info:395:(pid 0): hw_id 0x0000020d
[ 39.187205] mlx5_core 0000:82:00.0: print_health_info:396:(pid 0): irisc_index 8
[ 39.187794] mlx5_core 0000:82:00.0: print_health_info:397:(pid 0): synd 0x1: firmware internal error
[ 39.188378] mlx5_core 0000:82:00.0: print_health_info:399:(pid 0): ext_synd 0x8a47
[ 39.188959] mlx5_core 0000:82:00.0: print_health_info:401:(pid 0): raw fw_ver 0x10230fbe
[ 39.189562] mlx5_core 0000:81:00.0: poll_health:739:(pid 0): device’s health compromised - reached miss count
[ 39.190156] mlx5_core 0000:81:00.0: print_health_info:386:(pid 0): assert_var[0] 0x00000000
[ 39.190747] mlx5_core 0000:81:00.0: print_health_info:386:(pid 0): assert_var[1] 0xbadc0ffe
[ 39.191332] mlx5_core 0000:81:00.0: print_health_info:386:(pid 0): assert_var[2] 0x00000000
[ 39.191911] mlx5_core 0000:81:00.0: print_health_info:386:(pid 0): assert_var[3] 0x00000000
[ 39.192486] mlx5_core 0000:81:00.0: print_health_info:386:(pid 0): assert_var[4] 0x00000000
[ 39.193050] mlx5_core 0000:81:00.0: print_health_info:389:(pid 0): assert_exit_ptr 0x0088a288
[ 39.193608] mlx5_core 0000:81:00.0: print_health_info:391:(pid 0): assert_callra 0x0088c410
[ 39.194162] mlx5_core 0000:81:00.0: print_health_info:394:(pid 0): fw_ver 16.35.4030
[ 39.194685] mlx5_core 0000:81:00.0: print_health_info:395:(pid 0): hw_id 0x0000020d
[ 39.195182] mlx5_core 0000:81:00.0: print_health_info:396:(pid 0): irisc_index 8
[ 39.195675] mlx5_core 0000:81:00.0: print_health_info:397:(pid 0): synd 0x1: firmware internal error
[ 39.196166] mlx5_core 0000:81:00.0: print_health_info:399:(pid 0): ext_synd 0x8a47
[ 39.196652] mlx5_core 0000:81:00.0: print_health_info:401:(pid 0): raw fw_ver 0x10230fbe
[ 40.716927] mlx5_core 0000:82:00.1: poll_health:739:(pid 0): device’s health compromised - reached miss count
[ 40.717458] mlx5_core 0000:82:00.1: print_health_info:386:(pid 0): assert_var[0] 0x00000000
[ 40.717963] mlx5_core 0000:82:00.1: print_health_info:386:(pid 0): assert_var[1] 0xbadc0ffe
[ 40.718457] mlx5_core 0000:82:00.1: print_health_info:386:(pid 0): assert_var[2] 0x00000000
[ 40.718948] mlx5_core 0000:82:00.1: print_health_info:386:(pid 0): assert_var[3] 0x00000000
[ 40.719432] mlx5_core 0000:82:00.1: print_health_info:386:(pid 0): assert_var[4] 0x00000000
[ 40.719892] mlx5_core 0000:82:00.1: print_health_info:389:(pid 0): assert_exit_ptr 0x0088a288
[ 40.720340] mlx5_core 0000:82:00.1: print_health_info:391:(pid 0): assert_callra 0x0088c410
[ 40.720797] mlx5_core 0000:82:00.1: print_health_info:394:(pid 0): fw_ver 16.35.4030
[ 40.721249] mlx5_core 0000:82:00.1: print_health_info:395:(pid 0): hw_id 0x0000020d
[ 40.721706] mlx5_core 0000:82:00.1: print_health_info:396:(pid 0): irisc_index 8
[ 40.722169] mlx5_core 0000:82:00.1: print_health_info:397:(pid 0): synd 0x1: firmware internal error
[ 40.722637] mlx5_core 0000:82:00.1: print_health_info:399:(pid 0): ext_synd 0x8a47
[ 40.723104] mlx5_core 0000:82:00.1: print_health_info:401:(pid 0): raw fw_ver 0x10230fbe

The logs means, fw stuck miss watchdog counter. You can power cycle server see if still happen. If so, pls try reburn fw with MFT flint.

Thanks for reply xiaofengl,

I’ve rebooted the server and issue is still present. Is it just just warning, or does it effect the functionality ? According to your answer i can reburn FW :)

Thanks

Not warning, it is error log. You’d better re-burn firmware.

Hi,

sudo mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4119_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:02:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4119_pciconf1         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:81:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4121_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:82:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4121_pciconf1         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:83:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

i am not seeing cr0 interfaces, should i burn through pciconf interfaces ?

FYI,

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.