Error code -16 at startup on MCX516-GCAT once in a while.

Once in a while we are seeing an error code of -16 when booting the system. It has happened on multiple systems (initially we thought maybe it was just one flaky card). Rebooting pretty much always seems to fix it, but having to monitor for it and reboot isn’t great. We are currently running 4.9.185 kernel in case this is a driver problem. We have seen this on firmware 10.27.2008, and some earlier versions too. I upgraded to 10.28.2006 yesterday and haven’t seen it yet there, although I am not sure how many attemps I have to make to be sure. Also with 10.28.2006 (and 10.28.1002) we get strange errors at boot like this:

Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.0: firmware version: 16.28.1002

Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 48828Mbps

Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.1: firmware version: 16.28.1002

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 48828Mbps

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0 temp: renamed from eth3

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1 eth3: renamed from eth4

Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0 eth4: renamed from temp

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: device’s health compromised - reached miss count

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[0] 0xfffffffc

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[1] 0x00000001

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[2] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[3] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[4] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_exit_ptr 0x00991a18

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_callra 0x009919c4

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: fw_ver 1.28.1002

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: hw_id 0x0000020d

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: irisc_index 5

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: synd 0x1: firmware internal error

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: ext_synd 0x8bb4

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: device’s health compromised - reached miss count

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[0] 0xfffffffc

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[1] 0x00000001

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[2] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[3] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[4] 0x00000000

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_exit_ptr 0x00991a18

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_callra 0x009919c4

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: fw_ver 1.28.1002

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: hw_id 0x0000020d

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: irisc_index 5

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: synd 0x1: firmware internal error

Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: ext_synd 0x8bb4

Dec 4 09:19:09 c1-xca2 kernel: mlx5_core 0000:af:00.1 eth3: Link up

Any suggestion as to why the 2 latest firmwares do that? Every system I have tried either of the 16.28.x firmwares on do that (all using 4.9.185 kernel of course). I was trying the updated firmware since I thought perhaps 2100377 could explain the -16 error we sometimes see on boot when the driver fails to start the card and so far I haven’t seen it with 10.28.x but I am seeing those other strange errors (and neither our QA or customers will like seeing those in the logs. Someone will ask questions even if it is working).

Hello Lennart,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the information provided, we want to continue to debug this issue through a NVIDIA Networking Technical Support ticket. You currently have a valid support contract so when you send an email to support@mellanox.com it will open a support ticket which will be handled by one of our support engineers. Also please note in the email which time zone you are residing so we can re-route the ticket to your local support center.

Thank you and regards,

~NVIDIA Networking Technical Support

OK, I have sent of an email with as much details as I could think of.

Hi Martijn, sorry about that, could you able to share the root cause and solution of this symptom, ​I also found the same symptom from my side, appreciate.