ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout

Hi, I’m building a small 4‑node DGX Spark (GB10) cluster and I’m confused about the
achievable “200GbE” bandwidth.

Setup

  • 4× DGX Spark (GB10), Ubuntu 24.04/DGX OS (kernel 6.14.0-1015-nvidia)
  • Switch: MikroTik CRS812-8DS-2DQ-2DDQ (RouterOS 7.20.6), used as L2 bridge
  • Cables: NADDOD Q2Q56-400G-CU2 (QSFP‑DD 400G → 2×200G QSFP56 DAC breakout, 2m). Two
    QSFP‑DD ports on the CRS812 are broken out to connect all 4 DGX nodes.
  • On DGX: ethtool enp1s0f1np1 shows Speed: 200000Mb/s, Lanes: 4, link detected yes. On
    MikroTik the corresponding ports show 200G-baseCR4 and link up.

What we see

  • iperf3 (v3.16), TCP, pinned to the QSFP interface, MTU 1500:
    • With -P 16 we consistently get ~106 Gbit/s between any two nodes.
    • With MTU 9000 we only saw a small improvement (~111 Gbit/s).
    • Single stream -P 1 is ~30 Gbit/s.
  • NCCL:
    • nccl-tests (2.17.7) with NCCL_SOCKET_IFNAME=enp1s0f1np1 runs, but logs show Using
      network Socket (NET/Socket), not RDMA.
    • 4 nodes, 1 GPU per node:
      • all_gather_perf -b 1G -e 8G -f 2 → Avg bus bandwidth ~2.03 GB/s
      • all_reduce_perf -b 1G -e 4G -f 2 → Avg bus bandwidth ~2.07 GB/s
    • If I try to force NCCL_NET=IB, NCCL fails with “Failed to initialize NET plugin
      IB”.
  • Also, each physical CX7 port shows up as enp… and enP2p… (multi-host style). I saw in
    another thread that GB10 may have PCIe x4 limitations and might require bonding/
    aggregation of the “two halves”.

Questions

  1. Is ~100Gbps TCP throughput the expected ceiling on DGX Spark even when the link is
    negotiated at 200GbE? (PCIe x4 / multi-host limitation?)
  2. If full 200Gbps is achievable, what is the recommended configuration? Do we need to
    bond/aggregate enp… with enP2p… for a single physical link, and if so what mode
    (LACP vs balance-xor) is supported/recommended?
  3. For NCCL on Spark, what’s the expected best practice to validate network bandwidth?
    Should NCCL be using RoCE/verbs/IB (and which packages/plugins are required), or is
    NET/Socket expected? What “good” numbers should we expect on 200GbE?
  4. Any MikroTik CRS812 switch settings that are known to affect this (MTU, flow
    control, FEC, bridge HW offload, etc.)?

I can provide full outputs (ethtool -i, lspci -vv, ibdev2netdev, MikroTik port monitor,
iperf3 JSON, NCCL logs) if needed.

I wrote a benchmark tool to check configs. I just have two Sparks but using a single cable on NCCL, I get bidirectional bandwidth in the range of 45GB/sec or unidirectional “180 GbE equivalent”

OrthoSystemDDx-aarch64.zip (31.7 MB)

It will load a lobby for you to enter in the IP addresses of the nodes. The tool will rely on passwordless SSH/SCP the benchmark to the nodes and run.

There are experiments running 4, 6 and 8 sparks on the same switch as well, if you want to compare your numbers:

Thanks for sharing the benchmark tool!

We were trying to validate full 200G with 4 DGX Spark nodes on a MikroTik CRS812 using a QSFP-
DD → 2x200G breakout cable. What finally made things “click” was treating the single physical
port as two logical halves (enp1s0f1np1 + enP2p1s0f1np1 / rocep1s0f1 + roceP2p1s0f1) and
driving traffic on both concurrently.

After setting MTU 9000 end-to-end and disabling IPv6 on the CX7 interfaces (to keep RoCE GID
indices consistent), we now see:

  • iperf3: ~196–198 Gb/s aggregate per node-pair (two parallel sessions, one per half)
  • NCCL (4 nodes, all_reduce 256MiB, NET=IB with RDMA plugin): Avg bus bandwidth ~23.76 GB/s
    (~190 Gb/s class)

Single-flow iperf still tops out around ~100G, which matches the “2×100G halves” behavior. Also
worth noting: we had a leftover /etc/nccl.conf that was forcing Socket + NCCL_IB_DISABLE=1;
removing that fixed NET/IB initialization.

Appreciate the AppImage — we’ll try it on the nodes as well.

Thanks — these vLLM serving benchmarks are super helpful. Great to see 4-node TP still scaling
well on CRS812 DDQ + 400->2x200G breakouts.

For reference, we have 4x DGX Spark on the same CRS812-DDQ with a QSFP-DD → 2x200G breakout
cable. After enabling MTU 9000 end-to-end and assigning IPs to both logical halves of the port
(enp1s0f1np1 + enP2p1s0f1np1), we measure ~196–198 Gb/s aggregate with iperf3 between every
node pair (two parallel sessions, one per half), and NCCL all_reduce (4 nodes, 256MiB, NET=IB
with RDMA plugin) gives Avg busbw 23.76 GB/s.

Curious: are you also using both halves (enp1 + enP2p) and jumbo frames for the 200G aggregate?

1 Like

It’s getting more and more interesting now. Appreciate these “expensive” experiments.