8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

Sharing real-world results from running 8 GB10 nodes (4x ASUS Ascent GX10 + 4x Lenovo ThinkStation PGX) on a single MikroTik CRS812.

  • ✅ A single CRS812-8DS-2DQ-2DDQ with two 400DD→4x100G breakout cables can host an 8-node DGX Spark cluster
  • ✅ 200G vs 100G: per-stream decode speed is essentially unchanged (TTFT increases slightly when warm, noticeably on cold prefill)
  • 🚀 Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8 was faster than expected
  • Bonus question at the end: looking for GX10 stability tips

Hardware

  • Compute: 8x GB10 (SoC sm_121a, 128GB UMA each, ~1TB total)
    • ASUS Ascent GX10 × 4
    • Lenovo ThinkStation PGX × 4
  • Switch: MikroTik CRS812-8DS-2DQ-2DDQ (RouterOS 7.23)
  • Cabling: 2x 400DD→4x100G breakout cables (switch side: 1x 400G QSFP-DD → node side: 4x 100G QSFP28). Two cables × 4 nodes = all 8 nodes at 100G from just the two QSFP-DD ports
  • Driver: NVIDIA 580.159 on all 8 nodes (apt-mark hold)
  • OS: Ubuntu 24.04 LTS / DGX OS, kernel 6.17.0-1021-nvidia

The CRS812 only has two 400G QSFP-DD ports, but the breakout approach lets one switch absorb the entire 8-node cluster at 100G per node.

Network: 100G vs 200G (Measured)

Same TP=4 inference workload (Qwen3.5 397B-A17B int4-AutoRound, vLLM 0.22 + no-ray + 30 GiB KV) measured under both link configurations.

Benchmark: 5 iterations × 4 prompt sizes {8K, 16K, 64K, 128K} × n=4 concurrency, max_tokens=500, thinking off, direct endpoint (no proxy).

Per-stream decode tps (single stream)

Size All 200G All 100G Δ
8K 25.21 24.78 -1.7%
16K 25.78 25.48 -1.2%
64K 25.08 24.64 -1.8%
128K 23.48 24.20 +3.1% (noise)

Per-stream decode is essentially independent of link bandwidth (±3%). Qwen3.5-397B INT4 TP=4 decode is LPDDR5X UMA memory-bandwidth bound; NCCL all-reduce link bandwidth is not the bottleneck.

Aggregate throughput (n=4 concurrent, warm)

Size All 200G All 100G Δ
8K 68.7 tps 53.6 tps -22.0%
16K 78.6 tps 64.1 tps -18.5%
64K 77.0 tps 73.1 tps -5.0%
128K 80.8 tps 80.6 tps -0.2%

Aggregate throughput drops ~20% at short contexts (8K–16K), but the gap nearly disappears at 64K–128K.

TTFT (warm, prefix-cache hit)

Size All 200G All 100G Δ
8K 2.02s 4.15s +106%
16K 1.69s 3.12s +85%
64K 1.03s 1.60s +56%
128K 0.86s 1.11s +29%

The TTFT multiplier is larger at short contexts, but warm TTFT stays in the seconds-to-seconds range either way — barely perceptible.

Conclusion

From a production inference standpoint: decode tps is link-independent (memory bound), warm TTFT differences are small in absolute terms, and only cold prefill on large contexts is significantly affected. Collapsing all 8 nodes onto one switch at 100G is an acceptable trade-off for production.

⚠ Caveat: early in the project I hit a ConnectX-7 PCIe Power Throttle stuck state — after a cable hot-swap, inter-node bandwidth got stuck at ~13 Gbit/s until a host reboot. Worth checking if your inter-node bandwidth looks wrong.

Inference: Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8

Engine: scitrera/dgx-spark-sglang:0.5.12, key settings:

--tp-size 8 --pp-size 1
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3
--mem-fraction-static 0.85
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--max-mamba-cache-size 96
--cuda-graph-max-bs 8
--disable-piecewise-cuda-graph

NCCL_IB_HCA=rocep1s0f1   # specifying both RoCE lanes fails at startup (ibv_modify_qp error)
SGLANG_ENABLE_DEEP_GEMM=0   # --disable-deep-gemm flag not implemented in 0.5.12

Startup

  • NCCL init (8-node ring+tree): ~5 s
  • Weight load (113 safetensors shards): ~9 min
  • KV cache: 17.6M tokens (50.4 GB cluster-wide), Mamba cache: 96 slots
  • Total time to READY: ~10 min

n=1 baseline (no MTP, prefix cache enabled)

Size TTFT p50 TPOT p50 Decode p50
8K 5.5 s 73.9 ms 13.5 tps
16K 6.1 s 74.4 ms 13.5 tps
32K 6.5 s 74.3 ms 13.5 tps
64K 6.2 s 75.0 ms 13.3 tps
128K 4.8 s 76.0 ms 13.2 tps

⚠ Note on the flat TTFT: this is not true cold-prefill performance. The benchmark slices prompts of different sizes from the same source corpus, so larger prompts contain smaller ones as prefixes — SGLang’s radix cache hit range grows with prompt size, flattening TTFT (sometimes even making larger sizes faster).

True cold prefill (measured with radix cache disabled, see MTP section below) scales near-linearly: TTFT goes 5.7s → 46.7s from 8K → 64K at a stable ~1,380 tok/s prefill throughput. NemotronH’s hybrid architecture (Mamba-dominant, sparse attention) keeps this O(n)-ish rather than a Transformer’s O(n²), but prefill cost still grows with size.

Production takeaway: workloads that reuse a common system prompt (chatbots, agent loops) get seconds-range TTFT from the prefix cache; workloads with unique long prompts (RAG, batch) should budget against the 1,380 tok/s cold-prefill figure.

MTP NEXTN (speculative decoding)

--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-num-draft-tokens 4:

Size TTFT (true cold, radix off) Decode p50 vs baseline
8K 5.8 s 29.6 tps 2.19×
16K 11.7 s 29.4 tps 2.19×
32K 23.1 s 29.0 tps 2.16×
64K 46.7 s 29.0 tps 2.17×
128K n/a — a GX10 node crashed mid-run ⚠

Consistent 2.16–2.19× decode speedup with TPOT cut in half (74 ms → 34 ms) across 8K–64K.

Note: in SGLang 0.5.12, NemotronH + MTP cannot coexist with the radix cache (--disable-radix-cache required), so enabling MTP means losing prefix caching. Trade-off: long-form output / RAG / agentic → MTP on; short-response multi-turn chat with a shared system prompt → MTP off + prefix cache.

Getting a 550B model to ~30 tps per-stream with a 256K context window on an 8-node, individual-budget setup exceeded my expectations.

Bonus: GX10 stability tips wanted

Quick ask: my 4 ASUS Ascent GX10 units frequently go down under sustained inference load (e.g., the 128K cold-prefill runs above). The 4 Lenovo ThinkStation PGX units — same GB10 SoC — run the identical workload without issues, so I suspect something GX10-specific.

Symptoms:

  • Silent failure during inference: kernel ring buffer freezes → user space dies ~13 minutes later
  • Eventually progresses to a full power-off state (no ICMP/ARP; physical power-on required)
  • Neighboring nodes’ journalctl/dmesg show zero events related to the failed node

If anyone has tips for keeping the GX10 stable under this kind of workload — BIOS settings, power management, thermal control, driver options, anything — I’d really appreciate it.

Happy to share full recipes, NCCL flags, SGLang launch scripts, and breakout wiring diagrams if anyone wants to replicate this setup.

Gx10 has some heat issues. I printed a fan box and it’s stable. Thank you for the numbers can you test GLM 5.1?

sudo nvidia-smi -lgc 0,2000

Although it might seem excessive to limit performance on each node, for many, this change isn’t as impactful as this limit PP mostly, the memory is always the real bottleneck. Anyone doing inference on Nvidia gear is always turning down the power for less heat, less trouble.

You should also look up Henry’s “Spice Harvester” 3d-printed cooling setup.

I’ll add, given that I have the same switch, the darn Connectx-7 is a MAJOR contributor to heat, even though we only get about 100Gpbs out of it.