ConnectX-6 Dx: rx_discards_phy limits XDP redirect to ~100Mpps at 64-byte line rate (148Mpps)

Environment

  • NIC: ConnectX-6 Dx Dual Port 100GbE (0F6FXM_08P2T2_Ax)
  • Firmware: 22.46.3048
  • OFED: 23.10-3.2.2
  • CPU: AMD EPYC 9754 128-Core (single socket, 1 NUMA node)
  • Kernel: 6.1.0-43-amd64 (Debian)
  • PCIe: Gen4 x16, MaxPayload 512, MaxReadReq 4096

Problem

We are running an XDP program that redirects all packets from between interfaces (separate NICs on separate PCIe slots). Traffic generator sends 64-byte packets at 100G line rate (~148 Mpps).

The NIC receives all packets at the wire (rx_packets_phy = 148M/s) but only delivers ~80-100M pps to the host. The rest are dropped by the NIC hardware - rx_discards_phy increments at 45-68M/s.

Some more information:

  • rx_out_of_buffer = 0
  • CPU utilization is only 7%
  • Zero TX errors on the egress NIC (tx_xdp_full = 0, tx_xdp_err = 0)

We benchmarked different combined channel counts while keeping everything else constant:

Queue Count Scaling (key finding):

16 queues → 42.5 Mpps TX, 48.9 Mpps discards, 28.6% forwarded
32 queues → 81.5 Mpps TX, 32.3 Mpps discards, 54.8% forwarded
48 queues → 100.4 Mpps TX, 42.6 Mpps discards, 67.5% forwarded
64 queues → 93.9 Mpps TX, 54.2 Mpps discards, 63.1% forwarded
96 queues → 85.3 Mpps TX, 63.6 Mpps discards, 57.3% forwarded
127 queues → 81.0 Mpps TX, 67.9 Mpps discards, 54.4% forwarded

Performance peaks at 48 queues and degrades with more.

What we’ve tried (no significant improvement)

  • Interrupt coalescing: Tested adaptive on/off, rx-usecs 3-128, rx-frames 32-512 — no change
  • NAPI tuning: napi_defer_hard_irqs up to 50, gro_flush_timeout up to 200µs — no change at 127 queues
  • CQE compression: CQE_COMPRESSION=AGGRESSIVE (firmware) + rx_cqe_compress on (driver) — marginal improvement
  • PCIe relaxed ordering: PCI_WR_ORDERING=force_relax — no change
  • Virtual lanes: NUM_OF_VL_P1=1 (reduced from 4) — no change
  • MaxReadReq: Increased to 4096 — no change
  • Driver private flags: tx_cqe_compress on, xdp_tx_mpwqe on, tx-push on — ~2% improvement

Questions

  1. Is ~100 Mpps the expected maximum XDP redirect throughput for ConnectX-6 Dx with 64-byte packets? What is the NIC’s rated small-packet forwarding capacity?
  2. What does rx_discards_phy incrementing with rx_out_of_buffer=0 indicate? Is this an internal port buffer overflow or a scheduling/arbitration limit?
  3. Are there firmware parameters or NIC configuration options we haven’t explored that could increase the packet delivery rate?
  4. Would upgrading to OFED 24.10 or newer firmware improve small-packet XDP performance?
  5. Is ConnectX-7 expected to have a higher internal pps ceiling?

Any guidance on maximizing small-packet XDP redirect throughput would be greatly appreciated.

At 64-byte packets and 100G, you are pushing ~148 Mpps at the RX port. The NIC sees all of them on the wire (rx_packets_phy), but only ~80–100 Mpps can pass through the host-based XDP redirect path; the rest are dropped in hardware and counted in rx_discards_phy.

Per DOCA Telemetry, rx_discards_phy are packets dropped on the physical port due to lack of buffers (adapter congestion), and this is independent of rx_out_of_buffer (host RX WQE exhaustion). With rx_out_of_buffer=0, your RX rings are sized fine—the drop is earlier, because the ingress and host-facing pipeline are saturated in packets-per-second terms.

1/ Expected PPS / rating:

There is no published small-packet XDP redirect PPS guarantee for ConnectX-6 Dx. The ~80–100 Mpps you see for 64-byte XDP redirect is in line with what we expect from the host-based path at 100G; achieving the full ~148 Mpps at 64B typically requires a hardware offload path (e.g. eSwitch/ASAP²), not pure XDP redirect.

2/ Meaning of rx_discards_phy with rx_out_of_buffer=0:

This indicates congestion at the adapter’s physical port (lack of buffers) rather than a shortage of RX WQEs on the host queues. In other words, the adapter is oversubscribed by the 148 Mpps stream relative to what it can move to the host/XDP pipeline.

3/ Firmware / NIC tuning:

The main NIC-side XDP optimizations are features like XDP inline transmission of small packets and multi-packet TX WQEs, which you already have enabled via xdp_tx_mpwqe, CQE compression, etc. There are no documented firmware “hidden knobs” that raise an internal PPS limit beyond that; further gains come from CPU affinity, NAPI budgeting, and queue layout and are usually incremental.

4/ OFED 24.10 / newer firmware:

Running the latest MLNX_OFED/MLNX_EN and firmware is recommended and can provide incremental XDP improvements, but based on current public documentation we do not expect it alone to bridge the entire gap from ~100 Mpps to full 148 Mpps 64-byte redirect.

5/ ConnectX-7:

CX7 generally offers higher performance and more headroom, but there is no official XDP redirect PPS specification for it either. Host-based XDP redirect remains constrained by the host I/O and processing pipeline, so CX7 may help, but it is not a guaranteed way to reach line-rate 148 Mpps for this workload.