PCI-E Bus Errors with ConnectX-3 and Asus X-99E WS

Hi,

I am experiencing several problems when using a ConnectX-3 40GbE adapter (MCX313A-BCBT) in an Asus X99-E WS motherboard.

First it makes system startup quite unstable. Approx. 2 out of 10 tries, the system halts before POST, and shows error code 94 on the 7-segment display of the mainboard (meaning PCI Enumeration Error).

When it boots successfully, the latest Linux driver (mlnx-en-3.0-1.0.1.tgz), with the latest firmware, with Fedora 21 x86_64 (supported OS), fresh install, with a single NVidia GPU installed besides the HCA, it emits PCI bus errors during initialization. Sometimes it disables the card completely, sometimes it starts to work after a 1-1.5 minute wait during boot. When such errors occur, they look like:

[ 10.743067] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

[ 10.743077] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

[ 10.743142] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000

[ 10.743187] pcieport 0000:00:02.0: [14] Completion Timeout (First)

[ 10.743225] pcieport 0000:00:02.0: broadcast error_detected message

[ 16.852525] mlx4_core 0000:0a:00.0: command 0xff6 timed out (go bit not cleared)

[ 16.852527] mlx4_core 0000:0a:00.0: RUN_FW command failed, aborting

[ 16.855670] mlx4_core 0000:0a:00.0: mlx4_cmd_post:cmd_pending failed

[ 16.855702] mlx4_core 0000:0a:00.0: Failed to start FW, aborting

[ 17.858368] mlx4_core: probe of 0000:0a:00.0 failed with error -110

[ 17.858638] pcieport 0000:00:02.0: AER: Device recovery failed

[ 17.858643] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

[ 17.858652] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

[ 17.858735] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000

[ 17.858787] pcieport 0000:00:02.0: [14] Completion Timeout (First)

[ 17.858832] pcieport 0000:00:02.0: broadcast error_detected message

[ 17.858836] pcieport 0000:00:02.0: AER: Device recovery failed

[ 61.820905] mlx4_core: device is working in RoCE mode: Roce V1

[ 61.820907] mlx4_core: gid_type 1 for UD QPs is not supported by the devicegid_type 0 was chosen instead

[ 61.820908] mlx4_core: UD QP Gid type is: V1

[ 101.351233] mlx4_core 0000:0a:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s

[ 101.351235] mlx4_core 0000:0a:00.0: PCIe link width is x8, device supports x8

[ 101.354441] mlx4_core 0000:0a:00.0: irq 62 for MSI/MSI-X

[ 101.354445] mlx4_core 0000:0a:00.0: irq 63 for MSI/MSI-X

[ 101.354448] mlx4_core 0000:0a:00.0: irq 64 for MSI/MSI-X

[ 101.354451] mlx4_core 0000:0a:00.0: irq 65 for MSI/MSI-X

[ 101.354453] mlx4_core 0000:0a:00.0: irq 66 for MSI/MSI-X

[ 101.354456] mlx4_core 0000:0a:00.0: irq 67 for MSI/MSI-X

[ 101.354459] mlx4_core 0000:0a:00.0: irq 68 for MSI/MSI-X

[ 101.354462] mlx4_core 0000:0a:00.0: irq 69 for MSI/MSI-X

[ 101.354464] mlx4_core 0000:0a:00.0: irq 70 for MSI/MSI-X

[ 101.354466] mlx4_core 0000:0a:00.0: irq 71 for MSI/MSI-X

[ 101.354469] mlx4_core 0000:0a:00.0: irq 72 for MSI/MSI-X

[ 101.354471] mlx4_core 0000:0a:00.0: irq 73 for MSI/MSI-X

[ 101.354474] mlx4_core 0000:0a:00.0: irq 74 for MSI/MSI-X

[ 102.097189] mlx4_core 0000:0a:00.0: mlx4_pci_err_detected was called

[ 102.097198] mlx4_core 0000:0a:00.0: device is going to be reset

[ 102.125455] mlx4_en: Mellanox ConnectX HCA Ethernet driver v3.0-1.0.1 (Feb 2014)

[ 103.138702] mlx4_core 0000:0a:00.0: device was reset successfully

[ 103.138717] mlx4_core 0000:0a:00.0: Could not post command 0xd: ret=-5, in_param=0x65ae56000, in_mod=0x100, op_mod=0x0

[ 103.138721] mlx4_core 0000:0a:00.0: SW2HW_MPT failed (-5)

[ 103.138724] mlx4_en 0000:0a:00.0: Failed enabling memory region

[ 104.151519] pcieport 0000:00:02.0: AER: Device recovery failed

[ 104.151526] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

[ 104.151536] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

[ 104.151540] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000

[ 104.151543] pcieport 0000:00:02.0: [14] Completion Timeout (First)

[ 104.151548] pcieport 0000:00:02.0: broadcast error_detected message

[ 104.151553] mlx4_core 0000:0a:00.0: mlx4_pci_err_detected was called

[ 104.151556] ------------[ cut here ]------------

[ 104.151565] WARNING: CPU: 0 PID: 165 at drivers/pci/pci.c:1535 pci_disable_device+0x99/0xb0()

[ 104.151567] mlx4_core 0000:0a:00.0: disabling already-disabled device

[ 104.151569] Modules linked in:

[ 104.151571] mlx5_core(OE) mlx4_ib(OE) mlx4_en(OE) vxlan udp_tunnel nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT xt_conntrack cfg80211 ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_hdmi vfat x86_pkg_temp_thermal fat coretemp kvm crct10dif_pclmul crc32_pclmul snd_hda_intel crc32c_intel eeepc_wmi asus_wmi snd_hda_controller sparse_keymap rfkill snd_hda_codec iTCO_wdt iTCO_vendor_support ghash_clmulni_intel snd_hwdep snd_seq snd_seq_device snd_pcm sb_edac snd_timer serio_raw edac_core snd soundcore

[ 104.151619] mlx4_core(OE) mlx_compat(OE) mei_me i2c_i801 lpc_ich mei mfd_core shpchp tpm_infineon tpm_tis tpm nouveau video mxm_wmi igb drm_kms_helper ttm e1000e drm dca ata_generic ptp i2c_algo_bit pata_acpi pps_core wmi [last unloaded: mlx4_core]

[ 104.151642] CPU: 0 PID: 165 Comm: kworker/0:2 Tainted: G OE 3.17.4-301.fc21.x86_64 #1

[ 104.151644] Hardware name: ASUS All Series/X99-E WS, BIOS 1102 04/28/2015

[ 104.151650] Workqueue: events aer_isr

[ 104.151653] 0000000000000000 0000000017f53b38 ffff880659a8bbe8 ffffffff8173f929

[ 104.151657] ffff880659a8bc30 ffff880659a8bc20 ffffffff810970ad ffff88065ccbc000

[ 104.151661] ffff88065cc60510 0000000000000001 ffff880658ecfb10 ffff88065cc85800

[ 104.151665] Call Trace:

[ 104.151671] [] dump_stack+0x45/0x56

[ 104.151678] [] warn_slowpath_common+0x7d/0xa0

[ 104.151683] [] warn_slowpath_fmt+0x5c/0x80

[ 104.151696] [] ? mlx4_enter_error_state.part.7+0x188/0x350 [mlx4_core]

[ 104.151704] [] pci_disable_device+0x99/0xb0

[ 104.151720] [] mlx4_pci_err_detected+0x77/0xa0 [mlx4_core]

[ 104.151725] [] report_error_detected+0x50/0x100

[ 104.151730] [] ? find_source_device+0x80/0x80

[ 104.151734] [] pci_walk_bus+0x79/0xa0

[ 104.151738] [] ? find_source_device+0x80/0x80

[ 104.151742] [] broadcast_error_message+0xdc/0x100

[ 104.151746] [] do_recovery+0x43/0x280

[ 104.151750] [] ? get_device_error_info+0xd9/0x1b0

[ 104.151754] [] aer_isr+0x36a/0x450

[ 104.151761] [] process_one_work+0x14d/0x400

[ 104.151765] [] worker_thread+0x6b/0x4a0

[ 104.151770] [] ? rescuer_thread+0x2a0/0x2a0

[ 104.151773] [] kthread+0xea/0x100

[ 104.151777] [] ? kthread_create_on_node+0x1a0/0x1a0

[ 104.151783] [] ret_from_fork+0x7c/0xb0

[ 104.151787] [] ? kthread_create_on_node+0x1a0/0x1a0

[ 104.151789] —[ end trace 858d8c660747219b ]—

[ 104.151793] pcieport 0000:00:02.0: AER: Device recovery failed

[ 104.151796] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

[ 104.151803] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

[ 104.151807] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000

[ 104.151810] pcieport 0000:00:02.0: [14] Completion Timeout (First)

When using the same Mellanox card in a different mainboard (for example, a Gigabyte GA-Z97X-UD3H), it boots and inits flawlessly, using the exact same OS.

We have a cluster built up from these boards, and they all have the same issue randomly, so it’s not a unique error of a single mainboard, but looks like some incompatibility.

Did anybody experience a similar issue?

Please share any suggestions about how to stabilize this.

Thanks,

Peter

Update: it looks like the problems persist, even after another BIOS update (to version 1301) and setting PCI-E gen 1.

Some cards in the cluster still fail to initialize, while others do not even appear in lspci. After a reboot, they may or may not appear. Several reboots are necessary to correctly start everything up.

Could you try the latest 3.1-1.0.3 from http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers ?

Update: forcing the mainboard to use PCI-E 1.0 speed seems to solve the issue.

This can be done in the BIOS utility → Advanced → PCH Configuration → PCI Express Configuration → PCIe Speed: set from “Auto” to “Gen1”

This has bene suggested by one of the posters here, in relation with Nvidia GPUs suffering from a similar error in X-99 chipset based mainboards:

Question for X99 board owners with Nvidia cards: do you see PCIe bus errors, please respond to poll - Page 2 Question for X99 board owners with Nvidia cards: do you see PCIe bus errors, please respond to poll | Overclock.net

While downgrading PCIe is a workaround, it’s still an issue waiting for a proper solution.

I’m late to the party but I think we can confirm that this still occurs with CentOS 6.7 and driver version 3.1-1.0.3; I don’t have direct access to the systems but I am working to reproduce in a testing environment. If I can do anything to help accelerate resolution please let me know.

Hi same problem with Mellanox Connect-X3 CX354A QBCT and an Asus X99 Deluxe II Mainboard. Any updates here? Set PCIe speed to 1.0 slows down the whole system.