8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8
Sharing real-world results from running 8 GB10 nodes (4x ASUS Ascent GX10 + 4x Lenovo ThinkStation PGX) on a single MikroTik CRS812.
- ✅ A single CRS812-8DS-2DQ-2DDQ with two 400DD→4x100G breakout cables can host an 8-node DGX Spark cluster
- ✅ 200G vs 100G: per-stream decode speed is essentially unchanged (TTFT increases slightly when warm, noticeably on cold prefill)
- 🚀 Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8 was faster than expected
- Bonus question at the end: looking for GX10 stability tips
Hardware
- Compute: 8x GB10 (SoC sm_121a, 128GB UMA each, ~1TB total)
- ASUS Ascent GX10 × 4
- Lenovo ThinkStation PGX × 4
- Switch: MikroTik CRS812-8DS-2DQ-2DDQ (RouterOS 7.23)
- Cabling: 2x 400DD→4x100G breakout cables (switch side: 1x 400G QSFP-DD → node side: 4x 100G QSFP28). Two cables × 4 nodes = all 8 nodes at 100G from just the two QSFP-DD ports
- Driver: NVIDIA 580.159 on all 8 nodes (apt-mark hold)
- OS: Ubuntu 24.04 LTS / DGX OS, kernel 6.17.0-1021-nvidia
The CRS812 only has two 400G QSFP-DD ports, but the breakout approach lets one switch absorb the entire 8-node cluster at 100G per node.
Network: 100G vs 200G (Measured)
Same TP=4 inference workload (Qwen3.5 397B-A17B int4-AutoRound, vLLM 0.22 + no-ray + 30 GiB KV) measured under both link configurations.
Benchmark: 5 iterations × 4 prompt sizes {8K, 16K, 64K, 128K} × n=4 concurrency, max_tokens=500, thinking off, direct endpoint (no proxy).
Per-stream decode tps (single stream)
| Size | All 200G | All 100G | Δ |
|---|---|---|---|
| 8K | 25.21 | 24.78 | -1.7% |
| 16K | 25.78 | 25.48 | -1.2% |
| 64K | 25.08 | 24.64 | -1.8% |
| 128K | 23.48 | 24.20 | +3.1% (noise) |
Per-stream decode is essentially independent of link bandwidth (±3%). Qwen3.5-397B INT4 TP=4 decode is LPDDR5X UMA memory-bandwidth bound; NCCL all-reduce link bandwidth is not the bottleneck.
Aggregate throughput (n=4 concurrent, warm)
| Size | All 200G | All 100G | Δ |
|---|---|---|---|
| 8K | 68.7 tps | 53.6 tps | -22.0% |
| 16K | 78.6 tps | 64.1 tps | -18.5% |
| 64K | 77.0 tps | 73.1 tps | -5.0% |
| 128K | 80.8 tps | 80.6 tps | -0.2% |
Aggregate throughput drops ~20% at short contexts (8K–16K), but the gap nearly disappears at 64K–128K.
TTFT (warm, prefix-cache hit)
| Size | All 200G | All 100G | Δ |
|---|---|---|---|
| 8K | 2.02s | 4.15s | +106% |
| 16K | 1.69s | 3.12s | +85% |
| 64K | 1.03s | 1.60s | +56% |
| 128K | 0.86s | 1.11s | +29% |
The TTFT multiplier is larger at short contexts, but warm TTFT stays in the seconds-to-seconds range either way — barely perceptible.
Conclusion
From a production inference standpoint: decode tps is link-independent (memory bound), warm TTFT differences are small in absolute terms, and only cold prefill on large contexts is significantly affected. Collapsing all 8 nodes onto one switch at 100G is an acceptable trade-off for production.
⚠ Caveat: early in the project I hit a ConnectX-7 PCIe Power Throttle stuck state — after a cable hot-swap, inter-node bandwidth got stuck at ~13 Gbit/s until a host reboot. Worth checking if your inter-node bandwidth looks wrong.
Inference: Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8
Engine: scitrera/dgx-spark-sglang:0.5.12, key settings:
--tp-size 8 --pp-size 1
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3
--mem-fraction-static 0.85
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--max-mamba-cache-size 96
--cuda-graph-max-bs 8
--disable-piecewise-cuda-graph
NCCL_IB_HCA=rocep1s0f1 # specifying both RoCE lanes fails at startup (ibv_modify_qp error)
SGLANG_ENABLE_DEEP_GEMM=0 # --disable-deep-gemm flag not implemented in 0.5.12
Startup
- NCCL init (8-node ring+tree): ~5 s
- Weight load (113 safetensors shards): ~9 min
- KV cache: 17.6M tokens (50.4 GB cluster-wide), Mamba cache: 96 slots
- Total time to READY: ~10 min
n=1 baseline (no MTP, prefix cache enabled)
| Size | TTFT p50 | TPOT p50 | Decode p50 |
|---|---|---|---|
| 8K | 5.5 s | 73.9 ms | 13.5 tps |
| 16K | 6.1 s | 74.4 ms | 13.5 tps |
| 32K | 6.5 s | 74.3 ms | 13.5 tps |
| 64K | 6.2 s | 75.0 ms | 13.3 tps |
| 128K | 4.8 s | 76.0 ms | 13.2 tps |
⚠ Note on the flat TTFT: this is not true cold-prefill performance. The benchmark slices prompts of different sizes from the same source corpus, so larger prompts contain smaller ones as prefixes — SGLang’s radix cache hit range grows with prompt size, flattening TTFT (sometimes even making larger sizes faster).
True cold prefill (measured with radix cache disabled, see MTP section below) scales near-linearly: TTFT goes 5.7s → 46.7s from 8K → 64K at a stable ~1,380 tok/s prefill throughput. NemotronH’s hybrid architecture (Mamba-dominant, sparse attention) keeps this O(n)-ish rather than a Transformer’s O(n²), but prefill cost still grows with size.
Production takeaway: workloads that reuse a common system prompt (chatbots, agent loops) get seconds-range TTFT from the prefix cache; workloads with unique long prompts (RAG, batch) should budget against the 1,380 tok/s cold-prefill figure.
MTP NEXTN (speculative decoding)
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-num-draft-tokens 4:
| Size | TTFT (true cold, radix off) | Decode p50 | vs baseline |
|---|---|---|---|
| 8K | 5.8 s | 29.6 tps | 2.19× |
| 16K | 11.7 s | 29.4 tps | 2.19× |
| 32K | 23.1 s | 29.0 tps | 2.16× |
| 64K | 46.7 s | 29.0 tps | 2.17× |
| 128K | n/a — a GX10 node crashed mid-run ⚠ |
Consistent 2.16–2.19× decode speedup with TPOT cut in half (74 ms → 34 ms) across 8K–64K.
Note: in SGLang 0.5.12, NemotronH + MTP cannot coexist with the radix cache (--disable-radix-cache required), so enabling MTP means losing prefix caching. Trade-off: long-form output / RAG / agentic → MTP on; short-response multi-turn chat with a shared system prompt → MTP off + prefix cache.
Getting a 550B model to ~30 tps per-stream with a 256K context window on an 8-node, individual-budget setup exceeded my expectations.
Bonus: GX10 stability tips wanted
Quick ask: my 4 ASUS Ascent GX10 units frequently go down under sustained inference load (e.g., the 128K cold-prefill runs above). The 4 Lenovo ThinkStation PGX units — same GB10 SoC — run the identical workload without issues, so I suspect something GX10-specific.
Symptoms:
- Silent failure during inference: kernel ring buffer freezes → user space dies ~13 minutes later
- Eventually progresses to a full power-off state (no ICMP/ARP; physical power-on required)
- Neighboring nodes’ journalctl/dmesg show zero events related to the failed node
If anyone has tips for keeping the GX10 stable under this kind of workload — BIOS settings, power management, thermal control, driver options, anything — I’d really appreciate it.
Happy to share full recipes, NCCL flags, SGLang launch scripts, and breakout wiring diagrams if anyone wants to replicate this setup.