8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

Toshi.A · June 12, 2026, 10:57am

8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

Sharing real-world results from running 8 GB10 nodes (4x ASUS Ascent GX10 + 4x Lenovo ThinkStation PGX) on a single MikroTik CRS812.

✅ A single CRS812-8DS-2DQ-2DDQ with two 400DD→4x100G breakout cables can host an 8-node DGX Spark cluster
✅ 200G vs 100G: per-stream decode speed is essentially unchanged (TTFT increases slightly when warm, noticeably on cold prefill)
🚀 Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8 was faster than expected
Bonus question at the end: looking for GX10 stability tips

Hardware

Compute: 8x GB10 (SoC sm_121a, 128GB UMA each, ~1TB total)
- ASUS Ascent GX10 × 4
- Lenovo ThinkStation PGX × 4
Switch: MikroTik CRS812-8DS-2DQ-2DDQ (RouterOS 7.23)
Cabling: 2x 400DD→4x100G breakout cables (switch side: 1x 400G QSFP-DD → node side: 4x 100G QSFP28). Two cables × 4 nodes = all 8 nodes at 100G from just the two QSFP-DD ports
Driver: NVIDIA 580.159 on all 8 nodes (apt-mark hold)
OS: Ubuntu 24.04 LTS / DGX OS, kernel 6.17.0-1021-nvidia

The CRS812 only has two 400G QSFP-DD ports, but the breakout approach lets one switch absorb the entire 8-node cluster at 100G per node.

Network: 100G vs 200G (Measured)

Same TP=4 inference workload (Qwen3.5 397B-A17B int4-AutoRound, vLLM 0.22 + no-ray + 30 GiB KV) measured under both link configurations.

Benchmark: 5 iterations × 4 prompt sizes {8K, 16K, 64K, 128K} × n=4 concurrency, max_tokens=500, thinking off, direct endpoint (no proxy).

Per-stream decode tps (single stream)

Size	All 200G	All 100G	Δ
8K	25.21	24.78	-1.7%
16K	25.78	25.48	-1.2%
64K	25.08	24.64	-1.8%
128K	23.48	24.20	+3.1% (noise)

Per-stream decode is essentially independent of link bandwidth (±3%). Qwen3.5-397B INT4 TP=4 decode is LPDDR5X UMA memory-bandwidth bound; NCCL all-reduce link bandwidth is not the bottleneck.

Aggregate throughput (n=4 concurrent, warm)

Size	All 200G	All 100G	Δ
8K	68.7 tps	53.6 tps	-22.0%
16K	78.6 tps	64.1 tps	-18.5%
64K	77.0 tps	73.1 tps	-5.0%
128K	80.8 tps	80.6 tps	-0.2%

Aggregate throughput drops ~20% at short contexts (8K–16K), but the gap nearly disappears at 64K–128K.

TTFT (warm, prefix-cache hit)

Size	All 200G	All 100G	Δ
8K	2.02s	4.15s	+106%
16K	1.69s	3.12s	+85%
64K	1.03s	1.60s	+56%
128K	0.86s	1.11s	+29%

The TTFT multiplier is larger at short contexts, but warm TTFT stays in the seconds-to-seconds range either way — barely perceptible.

Conclusion

From a production inference standpoint: decode tps is link-independent (memory bound), warm TTFT differences are small in absolute terms, and only cold prefill on large contexts is significantly affected. Collapsing all 8 nodes onto one switch at 100G is an acceptable trade-off for production.

⚠ Caveat: early in the project I hit a ConnectX-7 PCIe Power Throttle stuck state — after a cable hot-swap, inter-node bandwidth got stuck at ~13 Gbit/s until a host reboot. Worth checking if your inter-node bandwidth looks wrong.

Inference: Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8

Engine: scitrera/dgx-spark-sglang:0.5.12, key settings:

--tp-size 8 --pp-size 1
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3
--mem-fraction-static 0.85
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--max-mamba-cache-size 96
--cuda-graph-max-bs 8
--disable-piecewise-cuda-graph

NCCL_IB_HCA=rocep1s0f1   # specifying both RoCE lanes fails at startup (ibv_modify_qp error)
SGLANG_ENABLE_DEEP_GEMM=0   # --disable-deep-gemm flag not implemented in 0.5.12

Startup

NCCL init (8-node ring+tree): ~5 s
Weight load (113 safetensors shards): ~9 min
KV cache: 17.6M tokens (50.4 GB cluster-wide), Mamba cache: 96 slots
Total time to READY: ~10 min

n=1 baseline (no MTP, prefix cache enabled)

Size	TTFT p50	TPOT p50	Decode p50
8K	5.5 s	73.9 ms	13.5 tps
16K	6.1 s	74.4 ms	13.5 tps
32K	6.5 s	74.3 ms	13.5 tps
64K	6.2 s	75.0 ms	13.3 tps
128K	4.8 s	76.0 ms	13.2 tps

⚠ Note on the flat TTFT: this is not true cold-prefill performance. The benchmark slices prompts of different sizes from the same source corpus, so larger prompts contain smaller ones as prefixes — SGLang’s radix cache hit range grows with prompt size, flattening TTFT (sometimes even making larger sizes faster).

True cold prefill (measured with radix cache disabled, see MTP section below) scales near-linearly: TTFT goes 5.7s → 46.7s from 8K → 64K at a stable ~1,380 tok/s prefill throughput. NemotronH’s hybrid architecture (Mamba-dominant, sparse attention) keeps this O(n)-ish rather than a Transformer’s O(n²), but prefill cost still grows with size.

Production takeaway: workloads that reuse a common system prompt (chatbots, agent loops) get seconds-range TTFT from the prefix cache; workloads with unique long prompts (RAG, batch) should budget against the 1,380 tok/s cold-prefill figure.

MTP NEXTN (speculative decoding)

--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-num-draft-tokens 4:

Size	TTFT (true cold, radix off)	Decode p50	vs baseline
8K	5.8 s	29.6 tps	2.19×
16K	11.7 s	29.4 tps	2.19×
32K	23.1 s	29.0 tps	2.16×
64K	46.7 s	29.0 tps	2.17×
128K	n/a — a GX10 node crashed mid-run ⚠

Consistent 2.16–2.19× decode speedup with TPOT cut in half (74 ms → 34 ms) across 8K–64K.

Note: in SGLang 0.5.12, NemotronH + MTP cannot coexist with the radix cache (--disable-radix-cache required), so enabling MTP means losing prefix caching. Trade-off: long-form output / RAG / agentic → MTP on; short-response multi-turn chat with a shared system prompt → MTP off + prefix cache.

Getting a 550B model to ~30 tps per-stream with a 256K context window on an 8-node, individual-budget setup exceeded my expectations.

Bonus: GX10 stability tips wanted

Quick ask: my 4 ASUS Ascent GX10 units frequently go down under sustained inference load (e.g., the 128K cold-prefill runs above). The 4 Lenovo ThinkStation PGX units — same GB10 SoC — run the identical workload without issues, so I suspect something GX10-specific.

Symptoms:

Silent failure during inference: kernel ring buffer freezes → user space dies ~13 minutes later
Eventually progresses to a full power-off state (no ICMP/ARP; physical power-on required)
Neighboring nodes’ journalctl/dmesg show zero events related to the failed node

If anyone has tips for keeping the GX10 stable under this kind of workload — BIOS settings, power management, thermal control, driver options, anything — I’d really appreciate it.

Happy to share full recipes, NCCL flags, SGLang launch scripts, and breakout wiring diagrams if anyone wants to replicate this setup.

aidendle94 · June 14, 2026, 7:23pm

Gx10 has some heat issues. I printed a fan box and it’s stable. Thank you for the numbers can you test GLM 5.1?

redacted.design · June 14, 2026, 11:55pm

sudo nvidia-smi -lgc 0,2000

Although it might seem excessive to limit performance on each node, for many, this change isn’t as impactful as this limit PP mostly, the memory is always the real bottleneck. Anyone doing inference on Nvidia gear is always turning down the power for less heat, less trouble.

You should also look up Henry’s “Spice Harvester” 3d-printed cooling setup.

I’ll add, given that I have the same switch, the darn Connectx-7 is a MAJOR contributor to heat, even though we only get about 100Gpbs out of it.

Topic		Replies	Views
ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout DGX Spark / GB10	5	1487	January 10, 2026
Dual DGX Spark RoCE Bandwidth Expectations DGX Spark / GB10	20	886	May 14, 2026
6x Spark setup DGX Spark / GB10	112	10008	April 25, 2026
Has anyone done DGX Spark / GB10 clusters larger than 8x? 16x? DGX Spark / GB10 networking , clustering	22	1287	June 12, 2026
Nemotron-3-Ultra-550B-A55B-NVFP4 on 4× DGX Spark via SGLang (TP=4 EP=4, RoCE) — it works, ~42–43 tok/s n8 peak DGX Spark / GB10 Projects cudnn , llama , nemotron	0	230	June 9, 2026
Adding node performance DGX Spark / GB10	5	586	December 15, 2025
Multi-Node DGX Spark Cluster (4×) — K3s, SGLang/vLLM, ConnectX-7 SR-IOV, Full Benchmark Matrix DGX Spark / GB10 Projects	0	359	April 3, 2026
DGX Spark direct QSFP connection only getting ~13-16 Gbps instead of expected 200G performance DGX Spark / GB10	10	493	May 14, 2026
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	815	April 16, 2026
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 DGX Spark / GB10 nemotron	31	1926	June 10, 2026

8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

Hardware

Network: 100G vs 200G (Measured)

Per-stream decode tps (single stream)

Aggregate throughput (n=4 concurrent, warm)

TTFT (warm, prefix-cache hit)

Conclusion

Inference: Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8

Startup

n=1 baseline (no MTP, prefix cache enabled)

MTP NEXTN (speculative decoding)

Bonus: GX10 stability tips wanted

Related topics