ConnectX-6 DX LACP Flap Issues

Hello,

Has anyone encountered similar issues with LACP flapping on the second port of a NIC card, or can provide helpful troubleshooting steps? Here’s our problem statement:

Setup:
• LACP LAG configured with 2x 100Gbps ports on the same NIC card

Issue Description:
• Port 2 randomly drops from the LAG, occurring up to hundreds of times per day
• Problem affects multiple servers, though many remain stable. Problem moves with NIC cards.
• NIC drivers are current (running latest LTS version)

Troubleshooting Performed:
• Server-side packet capture (TCPDUMP) shows no abnormalities in LACPDUs
• Switch-side packet capture reveals NIC sending LACPDUs on incorrect physical port during incidents

Any insights or suggestions would be greatly appreciated.

Another issue on forum that appears to be similar:

hi

If the issue follow the NIC cards, maybe some HW issue with the card.
You can RMA the issue card. https://enterprise-support.nvidia.com/s/request-rma

Thank you

Unfortunately, this issue is affecting hundreds of NICs.
I have opened a support case with the NVIDIA; however, they require testing the NICs with legacy switches as a prerequisite before proceeding with further investigation.

Thank you

Is it possible to sniff traffic from NIC card itself ?

From the OS level trace, it appears that lacpdu is handed to mlx5e_xmit function with correct information to be transmitted from ens3f0 but it appears to exit ens3f1 per the packet captures on the far end resulting LACP renegotiation.

[LACP PACKET DETECTED] skb=0xff2181e29aabf700
17:57:12.%f dev_queue_xmit: iface=ens3f0 len=124
  Source MAC: 88:xx:xx:xx:58:5c
  Actor Port: 1
  Actor State: 0x3d [Active Agg Sync Collect Distrib ]
17:57:12.%f dev_hard_start_xmit: +25 us
17:57:12.%f mlx5e_xmit: iface=ens3f0 sq=0xff2181e222138000 +28 us
17:57:12.%f   -> txwqe_build_eseg: building ethernet segment
17:57:12.%f   -> mlx5e_tx_mpwqe_session_start: MPWQE operation
17:57:12.%f   -> mlx5e_tx_mpwqe_session_complete: MPWQE operation
17:57:12.%f   -> tx_check_stop: checking if queue should stop
17:57:12.%f mlx5e_xmit returned: 0 [SUCCESS - PACKET SENT TO HW]
17:57:12.%f
[PACKET COMPLETE] skb=0xff2181e29aabf700 freed by napi_consume_skb
  Total time: 45 us (MLX5 time: 14 us)
  Source MAC: 88:xx:xx:xx:58:5c
  Actor Port: 1, State: 0x3d
  Original Interface: ens3f0
  Final Interface: ens3f0
17:57:12.%f   -> TX affinity: mlx5_infer_tx_affinity_mapping
17:57:12.%f   -> TX affinity: mlx5_infer_tx_enabled
17:57:12.%f   -> TX affinity: mlx5_infer_tx_enabled

[LACP PACKET DETECTED] skb=0xff2181ef5fc9ef00
17:57:13.%f dev_queue_xmit: iface=ens3f1 len=124
  Source MAC: 88:xx:xx:xx:58:5d
  Actor Port: 2
  Actor State: 0x0d [Active Agg Sync ]
17:57:13.%f dev_hard_start_xmit: +26 us
17:57:13.%f mlx5e_xmit: iface=ens3f1 sq=0xff2181e253860000 +34 us
17:57:13.%f   -> txwqe_build_eseg: building ethernet segment
17:57:13.%f   -> mlx5e_tx_mpwqe_session_start: MPWQE operation
17:57:13.%f   -> mlx5e_tx_mpwqe_session_complete: MPWQE operation
17:57:13.%f   -> tx_check_stop: checking if queue should stop
17:57:13.%f mlx5e_xmit returned: 0 [SUCCESS - PACKET SENT TO HW]
17:57:13.%f
[PACKET COMPLETE] skb=0xff2181ef5fc9ef00 freed by napi_consume_skb
  Total time: 60 us (MLX5 time: 23 us)
  Source MAC: 88:xx:xx:xx:58:5d
  Actor Port: 2, State: 0x0d
  Original Interface: ens3f1
  Final Interface: ens3f1
17:57:15.%f   -> TX affinity: mlx5_infer_tx_affinity_mapping
17:57:15.%f   -> TX affinity: mlx5_infer_tx_enabled
17:57:15.%f   -> TX affinity: mlx5_infer_tx_enabled

1.Server-side uses CX6 2*25G.
2.
1)Non-stacked network architecture, there are no interconnection links between the two Arista TOR switches.
2)server eth0—sw01 server eth1—sw02
3)SW02 receives LACP packets sent from eth0, causing port flapping.
4)Packet capture on server eth1 does not capture the LACP packets sent by eth0, but it is indeed captured on switch SW02.

Currently, we don’t know how to resolve this issue.

NVIDIA TAC refused to look into our issue after asking for cable/switch change to meet their prerequisite. They are now asking us to reach out to server vendor who sold their card. :-(

Will update if we are able to root cause this problem.