Nemotron-3-Ultra-550B-A55B-NVFP4 on 4× DGX Spark via SGLang (TP=4 EP=4, RoCE) — it works, ~42–43 tok/s n8 peak

ht12 · June 9, 2026, 8:24am

There are a few open threads here (NVFP4 on 4 Sparks, capacity planning on GB10) where people are trying to get the NVFP4 Ultra running with SGLang/vLLM on a 4-node Spark cluster and reporting “no joy so far.” The only confirmed Ultra success so far has been the 2-bit GGUF / llama.cpp-RPC route on 2 Sparks (~5 tok/s).

We got the native NVFP4 weights serving on SGLang across 4 DGX Sparks and want to share the exact recipe, because two non-obvious traps will hard-crash you at startup otherwise.

TL;DR: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 boots and serves on the mainline upstream image scitrera/dgx-spark-sglang:0.5.12 (no custom build needed) at TP=4, EP=4 over RoCE. 8/8 concurrent requests served clean, output coherent, ~5.3 tok/s/request, ~42–43 tok/s n8 peak, 512K context essentially free.

Full test log (all 7 matrix cases, crash traces, memory breakdown): TESTLOG on GitHub

Hardware / software

Component	Value
Nodes	4× DGX Spark (GB10, SM121/Blackwell), 128 GB unified mem each, 1 GPU per node
Topology	spark1 = head, spark2–4 = workers
Driver	580.159
Kernel	6.17.0-1018-nvidia
OS	Ubuntu 24.04 LTS (aarch64)
Interconnect	200GbE ConnectX-7, RoCE over SR-IOV VF (NCCL transport `roce`)
Image	`scitrera/dgx-spark-sglang:0.5.12` — upstream base, no custom kernel build
Orchestration	K3s v1.35.3, deployed via Ansible (links below)

The model is NemotronHForCausalLM (model_type=nemotron_h) — a Mamba2 + MoE + attention hybrid: 108 layers (48 mamba / 48 moe / 12 attention), 550B total / 55B active LatentMoE, 512 routed + 1 shared experts, NoPE (no RoPE — Mamba2 carries order), native ctx cap 262144. Quant is modelopt_mixed (FP4 expert FFN @ group_size 16, FP8/BF16 for attention/latent/embeddings).

The recipe

Parallelism: TP=4, PP=1, EP=4 (expert-parallel; the 512 experts shard 128/GPU). Weights land at 83.7 GB/GPU (note: this is less than the ~107 GB/GPU you’d naively estimate — the mixed-precision attn/latent/embedding tensors are smaller than the FP4 experts), load ~490 s.

Validated SGLang launch knobs (head shown; workers identical bar --node-rank):

python3 -m sglang.launch_server \
  --model-path nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --trust-remote-code \
  --quantization modelopt_fp4 \
  --tp-size 4 --pp-size 1 --ep-size 4 \
  --nnodes 4 --node-rank 0 --dist-init-addr <head-ip>:<port> \
  --context-length 262144 \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.9 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --max-mamba-cache-size 48 \
  --cuda-graph-max-bs 8 \
  --disable-piecewise-cuda-graph \
  --disable-deep-gemm \
  --reasoning-parser nemotron_3 \
  --tool-call-parser qwen3_coder

Key knobs and why:

Knob	Value	Why
`mem-fraction-static`	0.9	This is a post-weight reserve knob (higher = more KV), not a vLLM-style fraction-of-total. EP=4 floor is 0.88 — 0.85 startup-crashes (KV pool goes negative). 0.88/0.90/0.92 all serve with flat throughput.
`moe-runner-backend`	flashinfer_cutlass	The only viable MoE runner. `triton` is silently ignored on the NVFP4 modelopt path and crashes in `cutlass_moe_fp4` during CG capture (see trap #2).
`attention-backend`	flashinfer	`triton` is hard-asserted off on NemotronH (the hybrid starts with a Mamba layer, not attention).
`max-mamba-cache-size`	48	On this hybrid the Mamba state pool — not KV — is the concurrency ceiling: `max_running_requests = mamba_cache // 3`. 48 → 16 parallel. Default auto-fit gives only ~6.
`kv-cache-dtype`	fp8_e4m3	KV is wildly over-provisioned here anyway (NoPE + 96/108 non-attn layers).
`disable-piecewise-cuda-graph`	true	Hybrid graph doesn’t piecewise-capture cleanly; set by the model card.
`disable-deep-gemm`	true	DeepGemm targets FP8/ue8m0 scale format; on NVFP4 it JITs forever and exhausts host RAM.
`disable-cuda-graph`	false	Full CUDA graph. Eager mode is broken on the cutlass FP4 MoE path.

Reasoning is on by default in the chat template and toggled per request via extra_body={"chat_template_kwargs":{"enable_thinking":false}} (or {"low_effort":true}) — not a launch flag. SGLang’s reasoning parser is nemotron_3 (vLLM/TRT-LLM use the super_v3/ultra_v3 plugin instead — don’t copy those here).

Results (peak = Σ per-request tok/s, n8 = 8 concurrent)

mfs	ctx	fp4_gemm	moe	Status	n1	n4	n8 peak	n8 ok	n8 ttft
0.85	262k	fi_cutlass	fi_cutlass	CRASH (mfs too low)	—	—	—	—	—
0.88	262k	fi_cutlass	fi_cutlass	ok	10.2	29.3	43.4	8/8	1.77
0.90	262k	fi_cutlass	fi_cutlass	ok	10.1	27.9	42.0	8/8	1.95
0.92	262k	fi_cutlass	fi_cutlass	ok	10.1	29.6	42.8	8/8	1.63
0.90	524k	fi_cutlass	fi_cutlass	ok	10.0	29.5	43.2	8/8	2.37
0.90	262k	fi_cudnn	fi_cutlass	ok	10.1	28.9	42.0	8/8	1.71
0.90	262k	fi_cutlass	triton	CRASH (cutlass assert)	—	—	—	—	—

Per-request decode is a flat ~5.3 tok/s across every working case (it’s a 550B/55B-active model on TP=4 over RoCE). 512K context is essentially free (43.2 vs 42.0 @ 262k — within noise, 8/8 clean): NoPE + Mamba means KV barely grows. 1M is still untested.

The two startup traps

1. mem-fraction-static floor is 0.88 under EP=4, not 0.85.

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

This fires at KV profiling, not weight load. Under EP=4 the expert-dispatch buffers shrink rest_memory; at 0.85 the held-back reserve exceeds post-weight free memory, so the KV pool goes negative. Counter-intuitive name: raise mfs to fix it (more KV), don’t lower it.

2. moe-runner-backend triton crashes — flashinfer_cutlass is the only option.

sglang/srt/layers/moe/cutlass_moe.py:427
AssertionError: mismatch in expected `n`  (nx2_w1 == intermediate_size_per_partition * 2)

ModelOptFp4 always routes the FFN through cutlass_moe_fp4; the triton runner flag is effectively ignored, and the LatentMoE/512-expert shape trips a hard cutlass assertion during CG capture. Confirmed on both the Super sibling and Ultra.

(Aside: flashinfer_cudnn for the FP4 GEMM ties flashinfer_cutlass on peak throughput here — the matrix “winner=cudnn” label is an aggregate total_tokens/wall_time artifact, not kernel speed. We keep flashinfer_cutlass: equal speed, and cuDNN had a startup-crash discrepancy on the Super sibling with the same image tag.)

Full sources (GitHub)

Everything below lives in our public Ansible + Python repo for this 4-node DGX Spark K3s cluster — github.com/vroomfondel/dgxarley:

Test log (all cases, crash traces, memory breakdown, findings): TESTLOGS/sglang_nn4_tp4_ep4/nemotron-3-ultra-550b-a55b-nvfp4/
Model profile (canonical, fully annotated launch contract — every knob explained): roles/k8s_dgx/model_profiles/nvidia-nvidia-nemotron-3-ultra-550b-a55b-nvfp4.yml
SGLang deployment role (head + workers, Multus + RoCE-over-SR-IOV, HAProxy sidecar): roles/k8s_dgx/

Happy to answer questions on the RoCE/SR-IOV setup or the Ansible side. The Super-120B sibling and several Gemma/Qwen models have their own test logs in the same TESTLOGS/ tree if you’re comparing.

Topic		Replies	Views
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 DGX Spark / GB10 nemotron	31	2073	June 10, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	2215	December 22, 2025
Nemotron-3-Super-120B-A12B-NVFP4 + MTP on 4× DGX Spark via SGLang (TP=4, RoCE) - MTP actually pays off: 1.70× single-stream, accept-len ≈ 2.7 DGX Spark / GB10 Projects cudnn , nemotron	5	188	June 18, 2026
Nemotron-3-Super-120B-A12B-NVFP4 on single DGX Spark: 23.45 tok/s (spark-arena.com/ benhmarks) DGX Spark / GB10 cuda , benchmarks , spark , llm , nemotron , dgx , nemoclaw	6	913	May 26, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	10049	March 31, 2026
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	41	3436	January 24, 2026
Running nvidia/nemotron-3-super on DGX spark DGX Spark / GB10 nemotron	12	1917	March 26, 2026
Nemotron-3-Ultra-550B-A55B (2-bit GGUF) across 2× DGX Spark via llama.cpp RPC — it works (~5 tok/s) DGX Spark / GB10 llama , nemotron	7	703	June 7, 2026
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	832	April 16, 2026
Multi-node DGX Spark + SGLang win: Gemma-4-31B + MTP — +80 % @ n=8 (153 tok/s) on 4× GB10 DGX Spark / GB10 Projects	0	471	May 16, 2026