Hello.
We have some problems with the MCX312B and ASUS server platform ESC8000 G3.
Information about server:
Driver: 4.2-1.0.1
OS: ubuntu 14.04 4.4.0-116-generic
2 x MCX312B
8 x Nvidia 1080G GPU
We saw errors: AER error: Uncorrected (Non-Fatal) error received: id = 0010 for both network cards. After that, the network cards were resetted.
This error occurs randomly.
Log:
May 28 06:15:04 [4196877.274044] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
May 28 06:15:04 [4196877.274829] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Reque
ster ID)
May 28 06:15:04 [4196877.276009] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
May 28 06:15:04 [4196877.276607] pcieport 0000:00:02.0: [14] Completion Timeout (First)
May 28 06:15:04 [4196877.277172] pcieport 0000:00:02.0: broadcast error_detected message
May 28 06:15:04 [4196877.277719] mlx4_core 0000:04:00.0: mlx4_pci_err_detected was called
May 28 06:15:04 [4196877.278251] mlx4_core 0000:04:00.0: device is going to be reset
May 28 06:15:04 [4196877.278763] mlx4_core 0000:04:00.0: crdump: Dump was already collected, skipping
May 28 06:15:05 [4196878.280748] mlx4_core 0000:04:00.0: device was reset successfully
May 28 06:15:05 [4196878.281297] mlx4_en 0000:04:00.0: Internal error detected, restarting device
May 28 06:15:05 [4196878.281301] mlx4_core 0000:04:00.0: Could not post command 0x49: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0
May 28 06:15:05 [4196878.281310] mlx4_core 0000:04:00.0: Could not post command 0x43: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0
May 28 06:15:05 [4196878.282838] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started
May 28 06:15:05 [4196878.283377] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended
May 28 06:15:05 [4196878.284084] mlx4_en: eth2: Close port called
May 28 06:15:05 [4196878.300391] mlx4_core 0000:04:00.0: Fail to set mac in port 1 during unregister
May 28 06:15:06 [4196878.342788] bond2: Releasing active interface eth2
May 28 06:15:06 [4196878.347538] bond2: the permanent HWaddr of eth2 - ec:0d:9a:17:64:00 - is still in use by bond2 - set the HWaddr of eth2 to a different address to avoid conflicts
May 28 06:15:06 [4196878.348620] bond2: first active interface up!
May 28 06:15:06 [4196878.368454] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister
May 28 06:15:06 [4196878.369261] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister
May 28 06:15:06 [4196878.369824] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister
May 28 06:15:06 [4196878.370381] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister
May 28 06:15:06 [4196878.370937] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister
May 28 06:15:06 [4196878.371501] device eth2 left promiscuous mode
May 28 06:15:06 [4196878.372066] mlx4_en: eth2: Failed to pass user MAC(ec:0d:9a:17:64:00) to Firmware for port 1, with error -5
May 28 06:15:06 [4196878.456492] mlx4_en 0000:04:00.0: removed PHC
May 28 06:15:06 [4196878.457538] mlx4_en: eth3: Close port called
May 28 06:15:06 [4196878.472373] mlx4_core 0000:04:00.0: Fail to set mac in port 2 during unregister