ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout

baristankut · January 10, 2026, 8:25am

Hi, I’m building a small 4‑node DGX Spark (GB10) cluster and I’m confused about the
achievable “200GbE” bandwidth.

Setup

4× DGX Spark (GB10), Ubuntu 24.04/DGX OS (kernel 6.14.0-1015-nvidia)
Switch: MikroTik CRS812-8DS-2DQ-2DDQ (RouterOS 7.20.6), used as L2 bridge
Cables: NADDOD Q2Q56-400G-CU2 (QSFP‑DD 400G → 2×200G QSFP56 DAC breakout, 2m). Two
QSFP‑DD ports on the CRS812 are broken out to connect all 4 DGX nodes.
On DGX: ethtool enp1s0f1np1 shows Speed: 200000Mb/s, Lanes: 4, link detected yes. On
MikroTik the corresponding ports show 200G-baseCR4 and link up.

What we see

iperf3 (v3.16), TCP, pinned to the QSFP interface, MTU 1500:
- With -P 16 we consistently get ~106 Gbit/s between any two nodes.
- With MTU 9000 we only saw a small improvement (~111 Gbit/s).
- Single stream -P 1 is ~30 Gbit/s.
NCCL:
- nccl-tests (2.17.7) with NCCL_SOCKET_IFNAME=enp1s0f1np1 runs, but logs show Using
  network Socket (NET/Socket), not RDMA.
- 4 nodes, 1 GPU per node:
  - all_gather_perf -b 1G -e 8G -f 2 → Avg bus bandwidth ~2.03 GB/s
  - all_reduce_perf -b 1G -e 4G -f 2 → Avg bus bandwidth ~2.07 GB/s
- If I try to force NCCL_NET=IB, NCCL fails with “Failed to initialize NET plugin
  IB”.
Also, each physical CX7 port shows up as enp… and enP2p… (multi-host style). I saw in
another thread that GB10 may have PCIe x4 limitations and might require bonding/
aggregation of the “two halves”.

Questions

Is ~100Gbps TCP throughput the expected ceiling on DGX Spark even when the link is
negotiated at 200GbE? (PCIe x4 / multi-host limitation?)
If full 200Gbps is achievable, what is the recommended configuration? Do we need to
bond/aggregate enp… with enP2p… for a single physical link, and if so what mode
(LACP vs balance-xor) is supported/recommended?
For NCCL on Spark, what’s the expected best practice to validate network bandwidth?
Should NCCL be using RoCE/verbs/IB (and which packages/plugins are required), or is
NET/Socket expected? What “good” numbers should we expect on 200GbE?
Any MikroTik CRS812 switch settings that are known to affect this (MTU, flow
control, FEC, bridge HW offload, etc.)?

I can provide full outputs (ethtool -i, lspci -vv, ibdev2netdev, MikroTik port monitor,
iperf3 JSON, NCCL logs) if needed.

alan.dang · January 10, 2026, 3:18pm

I wrote a benchmark tool to check configs. I just have two Sparks but using a single cable on NCCL, I get bidirectional bandwidth in the range of 45GB/sec or unidirectional “180 GbE equivalent”

OrthoSystemDDx-aarch64.zip (31.7 MB)

It will load a lobby for you to enter in the IP addresses of the nodes. The tool will rely on passwordless SSH/SCP the benchmark to the nodes and run.

raphael.amorim · January 10, 2026, 5:32pm

There are experiments running 4, 6 and 8 sparks on the same switch as well, if you want to compare your numbers:

baristankut · January 10, 2026, 5:42pm

Thanks for sharing the benchmark tool!

We were trying to validate full 200G with 4 DGX Spark nodes on a MikroTik CRS812 using a QSFP-
DD → 2x200G breakout cable. What finally made things “click” was treating the single physical
port as two logical halves (enp1s0f1np1 + enP2p1s0f1np1 / rocep1s0f1 + roceP2p1s0f1) and
driving traffic on both concurrently.

After setting MTU 9000 end-to-end and disabling IPv6 on the CX7 interfaces (to keep RoCE GID
indices consistent), we now see:

iperf3: ~196–198 Gb/s aggregate per node-pair (two parallel sessions, one per half)
NCCL (4 nodes, all_reduce 256MiB, NET=IB with RDMA plugin): Avg bus bandwidth ~23.76 GB/s
(~190 Gb/s class)

Single-flow iperf still tops out around ~100G, which matches the “2×100G halves” behavior. Also
worth noting: we had a leftover /etc/nccl.conf that was forcing Socket + NCCL_IB_DISABLE=1;
removing that fixed NET/IB initialization.

Appreciate the AppImage — we’ll try it on the nodes as well.

baristankut · January 10, 2026, 6:03pm

Thanks — these vLLM serving benchmarks are super helpful. Great to see 4-node TP still scaling
well on CRS812 DDQ + 400->2x200G breakouts.

For reference, we have 4x DGX Spark on the same CRS812-DDQ with a QSFP-DD → 2x200G breakout
cable. After enabling MTU 9000 end-to-end and assigning IPs to both logical halves of the port
(enp1s0f1np1 + enP2p1s0f1np1), we measure ~196–198 Gb/s aggregate with iperf3 between every
node pair (two parallel sessions, one per half), and NCCL all_reduce (4 nodes, 256MiB, NET=IB
with RDMA plugin) gives Avg busbw 23.76 GB/s.

Curious: are you also using both halves (enp1 + enP2p) and jumbo frames for the 200G aggregate?

friend_py · January 10, 2026, 7:55pm

It’s getting more and more interesting now. Appreciate these “expensive” experiments.

Topic		Replies	Views
NCCL Test Bandwidth is only 3GB/s between 2 DGX Spark using QSFP cable DGX Spark / GB10 spark , nics , dgx	9	254	April 19, 2026
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	4723	December 2, 2025
DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s DGX Spark / GB10	3	771	November 5, 2025
DGX Spark NCCL Test: 15GB/s So Slow DGX Spark / GB10	1	197	March 4, 2026
NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE DGX Spark / GB10 pcie , kernel , performance , debugging-and-troubleshooting , nics , rdma	5	203	April 14, 2026
DGX Spark ↔ EdgeXpert NCCL only ~17 GB/s over 200GbE DGX Spark / GB10	4	242	April 9, 2026
Suggested cable to link two Sparks? DGX Spark / GB10	77	6597	December 8, 2025
Terrible throughput number between 2 DGX Sparks DGX Spark / GB10	2	310	March 4, 2026
Confusion surrounding the QSFP ports and bandwidth DGX Spark / GB10	9	885	January 15, 2026
Test the sample about "Connect Three DGX Spark in a Ring Topology" DGX Spark / GB10 cuda	15	458	April 13, 2026

ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout

Related topics