Once in a while we are seeing an error code of -16 when booting the system. It has happened on multiple systems (initially we thought maybe it was just one flaky card). Rebooting pretty much always seems to fix it, but having to monitor for it and reboot isn’t great. We are currently running 4.9.185 kernel in case this is a driver problem. We have seen this on firmware 10.27.2008, and some earlier versions too. I upgraded to 10.28.2006 yesterday and haven’t seen it yet there, although I am not sure how many attemps I have to make to be sure. Also with 10.28.2006 (and 10.28.1002) we get strange errors at boot like this:
Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.0: firmware version: 16.28.1002
Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 48828Mbps
Dec 4 09:18:43 c1-xca2 kernel: mlx5_core 0000:af:00.1: firmware version: 16.28.1002
Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 48828Mbps
Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0)
Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0)
Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0 temp: renamed from eth3
Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.1 eth3: renamed from eth4
Dec 4 09:18:44 c1-xca2 kernel: mlx5_core 0000:af:00.0 eth4: renamed from temp
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: device’s health compromised - reached miss count
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[0] 0xfffffffc
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[1] 0x00000001
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[2] 0x00000000
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[3] 0x00000000
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_var[4] 0x00000000
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_exit_ptr 0x00991a18
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: assert_callra 0x009919c4
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: fw_ver 1.28.1002
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: hw_id 0x0000020d
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: irisc_index 5
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: synd 0x1: firmware internal error
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.0: ext_synd 0x8bb4
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: device’s health compromised - reached miss count
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[0] 0xfffffffc
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[1] 0x00000001
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[2] 0x00000000
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[3] 0x00000000
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_var[4] 0x00000000
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_exit_ptr 0x00991a18
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: assert_callra 0x009919c4
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: fw_ver 1.28.1002
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: hw_id 0x0000020d
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: irisc_index 5
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: synd 0x1: firmware internal error
Dec 4 09:18:53 c1-xca2 kernel: mlx5_core 0000:af:00.1: ext_synd 0x8bb4
Dec 4 09:19:09 c1-xca2 kernel: mlx5_core 0000:af:00.1 eth3: Link up
Any suggestion as to why the 2 latest firmwares do that? Every system I have tried either of the 16.28.x firmwares on do that (all using 4.9.185 kernel of course). I was trying the updated firmware since I thought perhaps 2100377 could explain the -16 error we sometimes see on boot when the driver fails to start the card and so far I haven’t seen it with 10.28.x but I am seeing those other strange errors (and neither our QA or customers will like seeing those in the logs. Someone will ask questions even if it is working).