PCIe Bus Errors with ConnectX-3 Pro and ESC8000 G3

Hello.

We have some problems with the MCX312B and ASUS server platform ESC8000 G3.

Information about server:

Driver: 4.2-1.0.1

OS: ubuntu 14.04 4.4.0-116-generic

2 x MCX312B

8 x Nvidia 1080G GPU

We saw errors: AER error: Uncorrected (Non-Fatal) error received: id = 0010 for both network cards. After that, the network cards were resetted.

This error occurs randomly.

Log:

May 28 06:15:04 [4196877.274044] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

May 28 06:15:04 [4196877.274829] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Reque

ster ID)

May 28 06:15:04 [4196877.276009] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000

May 28 06:15:04 [4196877.276607] pcieport 0000:00:02.0: [14] Completion Timeout (First)

May 28 06:15:04 [4196877.277172] pcieport 0000:00:02.0: broadcast error_detected message

May 28 06:15:04 [4196877.277719] mlx4_core 0000:04:00.0: mlx4_pci_err_detected was called

May 28 06:15:04 [4196877.278251] mlx4_core 0000:04:00.0: device is going to be reset

May 28 06:15:04 [4196877.278763] mlx4_core 0000:04:00.0: crdump: Dump was already collected, skipping

May 28 06:15:05 [4196878.280748] mlx4_core 0000:04:00.0: device was reset successfully

May 28 06:15:05 [4196878.281297] mlx4_en 0000:04:00.0: Internal error detected, restarting device

May 28 06:15:05 [4196878.281301] mlx4_core 0000:04:00.0: Could not post command 0x49: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0

May 28 06:15:05 [4196878.281310] mlx4_core 0000:04:00.0: Could not post command 0x43: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0

May 28 06:15:05 [4196878.282838] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started

May 28 06:15:05 [4196878.283377] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended

May 28 06:15:05 [4196878.284084] mlx4_en: eth2: Close port called

May 28 06:15:05 [4196878.300391] mlx4_core 0000:04:00.0: Fail to set mac in port 1 during unregister

May 28 06:15:06 [4196878.342788] bond2: Releasing active interface eth2

May 28 06:15:06 [4196878.347538] bond2: the permanent HWaddr of eth2 - ec:0d:9a:17:64:00 - is still in use by bond2 - set the HWaddr of eth2 to a different address to avoid conflicts

May 28 06:15:06 [4196878.348620] bond2: first active interface up!

May 28 06:15:06 [4196878.368454] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:06 [4196878.369261] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:06 [4196878.369824] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:06 [4196878.370381] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:06 [4196878.370937] mlx4_core 0000:04:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:06 [4196878.371501] device eth2 left promiscuous mode

May 28 06:15:06 [4196878.372066] mlx4_en: eth2: Failed to pass user MAC(ec:0d:9a:17:64:00) to Firmware for port 1, with error -5

May 28 06:15:06 [4196878.456492] mlx4_en 0000:04:00.0: removed PHC

May 28 06:15:06 [4196878.457538] mlx4_en: eth3: Close port called

May 28 06:15:06 [4196878.472373] mlx4_core 0000:04:00.0: Fail to set mac in port 2 during unregister

continued…

May 28 06:15:06 [4196878.509546] bond1: Releasing active interface eth3

May 28 06:15:06 [4196878.514276] bond1: the permanent HWaddr of eth3 - ec:0d:9a:17:64:01 - is still in use by bond1 - set the HWaddr of eth3 to a different address to avoid conflicts

May 28 06:15:06 [4196878.515467] bond1: first active interface up!

May 28 06:15:06 [4196878.532430] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:06 [4196878.533167] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:06 [4196878.533770] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:06 [4196878.534345] mlx4_core 0000:04:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:06 [4196878.534907] mlx4_en: eth3: Failed to pass user MAC(ec:0d:9a:17:64:01) to Firmware for port 2, with error -5

May 28 06:15:07 [4196879.660429] mlx4_core 0000:05:00.0: mlx4_pci_err_detected was called

May 28 06:15:07 [4196879.661211] mlx4_core 0000:05:00.0: device is going to be reset

May 28 06:15:07 [4196879.661837] mlx4_core 0000:05:00.0: crdump: Dump was already collected, skipping

May 28 06:15:08 [4196880.665907] mlx4_core 0000:05:00.0: device was reset successfully

May 28 06:15:08 [4196880.666581] mlx4_en 0000:05:00.0: Internal error detected, restarting device

May 28 06:15:08 [4196880.666584] mlx4_core 0000:05:00.0: Could not post command 0x49: ret=-5, in_param=0x0, in_mod=0x2, op_mod=0x0

May 28 06:15:08 [4196880.667885] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started

May 28 06:15:08 [4196880.668564] <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended

May 28 06:15:08 [4196880.669483] mlx4_en: eth4: Close port called

May 28 06:15:08 [4196880.684153] mlx4_core 0000:05:00.0: Fail to set mac in port 1 during unregister

May 28 06:15:08 [4196880.724077] bond2: Removing an active aggregator

May 28 06:15:08 [4196880.728869] bond2: Releasing active interface eth4

May 28 06:15:08 [4196880.748641] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:08 [4196880.749286] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:08 [4196880.749912] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:08 [4196880.750529] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:08 [4196880.751139] mlx4_core 0000:05:00.0: Fail to set vlan in port 1 during unregister

May 28 06:15:08 [4196880.751733] device eth4 left promiscuous mode

May 28 06:15:08 [4196880.752338] mlx4_en: eth4: Failed to pass user MAC(ec:0d:9a:17:63:e0) to Firmware for port 1, with error -5

May 28 06:15:08 [4196881.108237] mlx4_en 0000:05:00.0: removed PHC

May 28 06:15:08 [4196881.109386] mlx4_en: eth5: Close port called

May 28 06:15:08 [4196881.124131] mlx4_core 0000:05:00.0: Fail to set mac in port 2 during unregister

May 28 06:15:08 [4196881.167114] bond1: Removing an active aggregator

May 28 06:15:08 [4196881.171891] bond1: Releasing active interface eth5

May 28 06:15:08 [4196881.184362] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:08 [4196881.184982] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:08 [4196881.185573] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:08 [4196881.186138] mlx4_core 0000:05:00.0: Fail to set vlan in port 2 during unregister

May 28 06:15:08 [4196881.186696] mlx4_en: eth5: Failed to pass user MAC(ec:0d:9a:17:63:e1) to Firmware for port 2, with error -5

May 28 06:15:10 [4196882.368172] pcieport 0000:00:02.0: broadcast slot_reset message

May 28 06:15:10 [4196882.369063] mlx4_core 0000:04:00.0: mlx4_pci_slot_reset was called

May 28 06:15:10 [4196882.371798] mlx4_core 0000:05:00.0: mlx4_pci_slot_reset was called

May 28 06:15:10 [4196882.377890] pcieport 0000:00:02.0: broadcast resume message

May 28 06:15:10 [4196882.378505] mlx4_core 0000:04:00.0: mlx4_pci_resume was called

May 28 06:15:15 [4196887.953110] mlx4_core: device is working in RoCE mode: Roce V1

May 28 06:15:15 [4196887.953742] mlx4_core: UD QP Gid type is: V1

May 28 06:15:17 [4196889.766596] mlx4_core 0000:04:00.0: DMFS high rate steer mode is: performance optimized for limited rule configuration (static)

May 28 06:15:17 [4196889.768097] mlx4_core 0000:04:00.0: PCIe BW is different than device’s capability

May 28 06:15:27 [4196900.129419] mlx4_core 0000:05:00.0: PCIe BW is different than device’s capability

May 28 06:15:27 [4196900.129961] mlx4_core 0000:05:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s

May 28 06:15:27 [4196900.130522] mlx4_core 0000:05:00.0: PCIe link width is x8, device supports x8

May 28 06:15:28 [4196900.891587] mlx4_en 0000:05:00.0: Activating port:1

May 28 06:15:28 [4196900.911848] mlx4_en: 0000:05:00.0: Port 1: Using 32 TX rings

May 28 06:15:28 [4196900.912628] mlx4_en: 0000:05:00.0: Port 1: Using 16 RX rings

May 28 06:15:28 [4196900.913632] mlx4_en: 0000:05:00.0: Port 1: Initializing port

May 28 06:15:28 [4196900.916585] mlx4_en 0000:05:00.0: registered PHC clock

May 28 06:15:28 [4196900.917588] mlx4_en 0000:05:00.0: Activating port:2

May 28 06:15:28 [4196900.921897] mlx4_en: 0000:05:00.0: Port 2: Using 32 TX rings

May 28 06:15:28 [4196900.922442] mlx4_en: 0000:05:00.0: Port 2: Using 16 RX rings

May 28 06:15:28 [4196900.924792] mlx4_en: 0000:05:00.0: Port 2: Initializing port

May 28 06:15:28 [4196900.943370] <mlx4_ib> mlx4_ib_add: counter index 2 for port 1 allocated 1

May 28 06:15:28 [4196900.943892] <mlx4_ib> mlx4_ib_add: counter index 3 for port 2 allocated 1

May 28 06:15:28 [4196900.963892] pcieport 0000:00:02.0: AER: Device recovery successful

May 28 06:15:28 [4196900.982030] mlx4_en: eth4: Link Up

May 28 06:15:28 [4196900.982576] mlx4_en: eth5: Link Up

continued…

Is it a correct behavior for the network card to be resetted after this error?

Did anybody experience a similar issue?

Please share any suggestions about how to fix this.

Hi Aleksey,

Thank you for posting your question on the Mellanox Community.

Can you please install the latest MLNX_OFED version which is version 4.3 including the latest firmware for the ConnectX-3 Pro EN ( For version 4.3 you need to have Ubuntu 16.04 installed)

If not possible to upgrade the OS to install the latest driver, please check if the ConnectX-3 cards have the latest available firmware installed, which you can check through the following link Firmware for ConnectX®-3 Pro EN Firmware for ConnectX®-3 Pro EN

The firmware needs to be upgraded to 2.42.5000.

Please update with your new findings.

Thanks and regards,

~Mellanox Technical Support

Hi Aleksey,

We noticed that you also opened a case with support@mellanox.com mailto:support@mellanox.com

We will continue the debug through the case and update you accordingly.

Thanks and regards,

~Martijn