Error code -16 at startup on MCX516-GCAT once in a while.

lsorensen · December 4, 2020, 2:54pm

Once in a while we are seeing an error code of -16 when booting the system. It has happened on multiple systems (initially we thought maybe it was just one flaky card). Rebooting pretty much always seems to fix it, but having to monitor for it and reboot isn’t great. We are currently running 4.9.185 kernel in case this is a driver problem. We have seen this on firmware 10.27.2008, and some earlier versions too. I upgraded to 10.28.2006 yesterday and haven’t seen it yet there, although I am not sure how many attemps I have to make to be sure. Also with 10.28.2006 (and 10.28.1002) we get strange errors at boot like this:

Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.0: firmware version: 16.28.1002

Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 48828Mbps

Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.1: firmware version: 16.28.1002

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 48828Mbps

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0 temp: renamed from eth3

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1 eth3: renamed from eth4

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0 eth4: renamed from temp

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: device’s health compromised - reached miss count

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[0] 0xfffffffc

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[1] 0x00000001

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[2] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[3] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[4] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_exit_ptr 0x00991a18

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_callra 0x009919c4

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: fw_ver 1.28.1002

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: hw_id 0x0000020d

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: irisc_index 5

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: synd 0x1: firmware internal error

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: ext_synd 0x8bb4

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: device’s health compromised - reached miss count

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[0] 0xfffffffc

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[1] 0x00000001

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[2] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[3] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[4] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_exit_ptr 0x00991a18

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_callra 0x009919c4

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: fw_ver 1.28.1002

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: hw_id 0x0000020d

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: irisc_index 5

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: synd 0x1: firmware internal error

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: ext_synd 0x8bb4

Dec 4 09:19:09 c1-xca2 kernel: mlx5_core 0000:af:00.1 eth3: Link up

Any suggestion as to why the 2 latest firmwares do that? Every system I have tried either of the 16.28.x firmwares on do that (all using 4.9.185 kernel of course). I was trying the updated firmware since I thought perhaps 2100377 could explain the -16 error we sometimes see on boot when the driver fails to start the card and so far I haven’t seen it with 10.28.x but I am seeing those other strange errors (and neither our QA or customers will like seeing those in the logs. Someone will ask questions even if it is working).

MvB · December 5, 2020, 10:20pm

Hello Lennart,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the information provided, we want to continue to debug this issue through a NVIDIA Networking Technical Support ticket. You currently have a valid support contract so when you send an email to support@mellanox.com it will open a support ticket which will be handled by one of our support engineers. Also please note in the email which time zone you are residing so we can re-route the ticket to your local support center.

Thank you and regards,

~NVIDIA Networking Technical Support

lsorensen · December 7, 2020, 2:48pm

OK, I have sent of an email with as much details as I could think of.

jeff8918389183 · December 1, 2021, 7:41am

Hi Martijn, sorry about that, could you able to share the root cause and solution of this symptom, I also found the same symptom from my side, appreciate.

Topic		Replies	Views
QUERY_FW command failed, err=-5, aborting when rebooting machine Ethernet Adapter Cards	2	679	March 11, 2015
command 0x54 failed: fw status = 0x2 Ethernet Adapter Cards flint	8	1562	May 3, 2017
"unspecified driver error" CUDA Programming and Performance	17	38827	November 6, 2007
Occure mlx5_core :" failed to allocate command entry" with MCX4421A-ACAN in Ubtuntu 18.04.1 Software And Drivers flint , ofed_info-s , mst-start , mst-status	3	2570	January 31, 2019
mlx5_core enable hca failed, mlx5_load_one failed with error code -22 mst , flint	1	2100	August 7, 2018
ConnectX-En 10G crash under load :-(	9	346	May 8, 2013
Mellanox driver 4.6-1.0.1.1 errors after upgraded firmware to 16.26.1040 Software And Drivers	1	248	September 17, 2020
SX6012 presenting errors on Boot InfiniBand/VPI Switch Systems	0	386	February 22, 2017
Why can't I load MLX5 module in kernel 4.19 Software And Drivers infiniband , uname	7	2699	September 9, 2019
NVRM Xid error 59 with Kepler card (CUDA) on 4th PCIe 3.0 port Linux	6	5002	July 2, 2013

Error code -16 at startup on MCX516-GCAT once in a while.

Related topics