Hi, I’m building a small 4‑node DGX Spark (GB10) cluster and I’m confused about the
achievable “200GbE” bandwidth.
Setup
4× DGX Spark (GB10), Ubuntu 24.04/DGX OS (kernel 6.14.0-1015-nvidia)
Switch: MikroTik CRS812-8DS-2DQ-2DDQ (RouterOS 7.20.6), used as L2 bridge
Cables: NADDOD Q2Q56-400G-CU2 (QSFP‑DD 400G → 2×200G QSFP56 DAC breakout, 2m). Two
QSFP‑DD ports on the CRS812 are broken out to connect all 4 DGX nodes.
On DGX: ethtool enp1s0f1np1 shows Speed: 200000Mb/s, Lanes: 4, link detected yes. On
MikroTik the corresponding ports show 200G-baseCR4 and link up.
What we see
iperf3 (v3.16), TCP, pinned to the QSFP interface, MTU 1500:
With -P 16 we consistently get ~106 Gbit/s between any two nodes.
With MTU 9000 we only saw a small improvement (~111 Gbit/s).
Single stream -P 1 is ~30 Gbit/s.
NCCL:
nccl-tests (2.17.7) with NCCL_SOCKET_IFNAME=enp1s0f1np1 runs, but logs show Using
network Socket (NET/Socket), not RDMA.
If I try to force NCCL_NET=IB, NCCL fails with “Failed to initialize NET plugin
IB”.
Also, each physical CX7 port shows up as enp… and enP2p… (multi-host style). I saw in
another thread that GB10 may have PCIe x4 limitations and might require bonding/
aggregation of the “two halves”.
Questions
Is ~100Gbps TCP throughput the expected ceiling on DGX Spark even when the link is
negotiated at 200GbE? (PCIe x4 / multi-host limitation?)
If full 200Gbps is achievable, what is the recommended configuration? Do we need to
bond/aggregate enp… with enP2p… for a single physical link, and if so what mode
(LACP vs balance-xor) is supported/recommended?
For NCCL on Spark, what’s the expected best practice to validate network bandwidth?
Should NCCL be using RoCE/verbs/IB (and which packages/plugins are required), or is
NET/Socket expected? What “good” numbers should we expect on 200GbE?
Any MikroTik CRS812 switch settings that are known to affect this (MTU, flow
control, FEC, bridge HW offload, etc.)?
I can provide full outputs (ethtool -i, lspci -vv, ibdev2netdev, MikroTik port monitor,
iperf3 JSON, NCCL logs) if needed.
I wrote a benchmark tool to check configs. I just have two Sparks but using a single cable on NCCL, I get bidirectional bandwidth in the range of 45GB/sec or unidirectional “180 GbE equivalent”
We were trying to validate full 200G with 4 DGX Spark nodes on a MikroTik CRS812 using a QSFP-
DD → 2x200G breakout cable. What finally made things “click” was treating the single physical
port as two logical halves (enp1s0f1np1 + enP2p1s0f1np1 / rocep1s0f1 + roceP2p1s0f1) and
driving traffic on both concurrently.
After setting MTU 9000 end-to-end and disabling IPv6 on the CX7 interfaces (to keep RoCE GID
indices consistent), we now see:
iperf3: ~196–198 Gb/s aggregate per node-pair (two parallel sessions, one per half)
NCCL (4 nodes, all_reduce 256MiB, NET=IB with RDMA plugin): Avg bus bandwidth ~23.76 GB/s
(~190 Gb/s class)
Single-flow iperf still tops out around ~100G, which matches the “2×100G halves” behavior. Also
worth noting: we had a leftover /etc/nccl.conf that was forcing Socket + NCCL_IB_DISABLE=1;
removing that fixed NET/IB initialization.
Appreciate the AppImage — we’ll try it on the nodes as well.
Thanks — these vLLM serving benchmarks are super helpful. Great to see 4-node TP still scaling
well on CRS812 DDQ + 400->2x200G breakouts.
For reference, we have 4x DGX Spark on the same CRS812-DDQ with a QSFP-DD → 2x200G breakout
cable. After enabling MTU 9000 end-to-end and assigning IPs to both logical halves of the port
(enp1s0f1np1 + enP2p1s0f1np1), we measure ~196–198 Gb/s aggregate with iperf3 between every
node pair (two parallel sessions, one per half), and NCCL all_reduce (4 nodes, 256MiB, NET=IB
with RDMA plugin) gives Avg busbw 23.76 GB/s.
Curious: are you also using both halves (enp1 + enP2p) and jumbo frames for the 200G aggregate?