There are a few open threads here (NVFP4 on 4 Sparks, capacity planning on GB10) where people are trying to get the NVFP4 Ultra running with SGLang/vLLM on a 4-node Spark cluster and reporting “no joy so far.” The only confirmed Ultra success so far has been the 2-bit GGUF / llama.cpp-RPC route on 2 Sparks (~5 tok/s).
We got the native NVFP4 weights serving on SGLang across 4 DGX Sparks and want to share the exact recipe, because two non-obvious traps will hard-crash you at startup otherwise.
TL;DR: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 boots and serves on the mainline upstream image scitrera/dgx-spark-sglang:0.5.12 (no custom build needed) at TP=4, EP=4 over RoCE. 8/8 concurrent requests served clean, output coherent, ~5.3 tok/s/request, ~42–43 tok/s n8 peak, 512K context essentially free.
Full test log (all 7 matrix cases, crash traces, memory breakdown): TESTLOG on GitHub
Hardware / software
| Component | Value |
|---|---|
| Nodes | 4× DGX Spark (GB10, SM121/Blackwell), 128 GB unified mem each, 1 GPU per node |
| Topology | spark1 = head, spark2–4 = workers |
| Driver | 580.159 |
| Kernel | 6.17.0-1018-nvidia |
| OS | Ubuntu 24.04 LTS (aarch64) |
| Interconnect | 200GbE ConnectX-7, RoCE over SR-IOV VF (NCCL transport roce) |
| Image | scitrera/dgx-spark-sglang:0.5.12 — upstream base, no custom kernel build |
| Orchestration | K3s v1.35.3, deployed via Ansible (links below) |
The model is NemotronHForCausalLM (model_type=nemotron_h) — a Mamba2 + MoE + attention hybrid: 108 layers (48 mamba / 48 moe / 12 attention), 550B total / 55B active LatentMoE, 512 routed + 1 shared experts, NoPE (no RoPE — Mamba2 carries order), native ctx cap 262144. Quant is modelopt_mixed (FP4 expert FFN @ group_size 16, FP8/BF16 for attention/latent/embeddings).
The recipe
Parallelism: TP=4, PP=1, EP=4 (expert-parallel; the 512 experts shard 128/GPU). Weights land at 83.7 GB/GPU (note: this is less than the ~107 GB/GPU you’d naively estimate — the mixed-precision attn/latent/embedding tensors are smaller than the FP4 experts), load ~490 s.
Validated SGLang launch knobs (head shown; workers identical bar --node-rank):
python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--trust-remote-code \
--quantization modelopt_fp4 \
--tp-size 4 --pp-size 1 --ep-size 4 \
--nnodes 4 --node-rank 0 --dist-init-addr <head-ip>:<port> \
--context-length 262144 \
--kv-cache-dtype fp8_e4m3 \
--mem-fraction-static 0.9 \
--attention-backend flashinfer \
--moe-runner-backend flashinfer_cutlass \
--max-mamba-cache-size 48 \
--cuda-graph-max-bs 8 \
--disable-piecewise-cuda-graph \
--disable-deep-gemm \
--reasoning-parser nemotron_3 \
--tool-call-parser qwen3_coder
Key knobs and why:
| Knob | Value | Why |
|---|---|---|
mem-fraction-static |
0.9 | This is a post-weight reserve knob (higher = more KV), not a vLLM-style fraction-of-total. EP=4 floor is 0.88 — 0.85 startup-crashes (KV pool goes negative). 0.88/0.90/0.92 all serve with flat throughput. |
moe-runner-backend |
flashinfer_cutlass | The only viable MoE runner. triton is silently ignored on the NVFP4 modelopt path and crashes in cutlass_moe_fp4 during CG capture (see trap #2). |
attention-backend |
flashinfer | triton is hard-asserted off on NemotronH (the hybrid starts with a Mamba layer, not attention). |
max-mamba-cache-size |
48 | On this hybrid the Mamba state pool — not KV — is the concurrency ceiling: max_running_requests = mamba_cache // 3. 48 → 16 parallel. Default auto-fit gives only ~6. |
kv-cache-dtype |
fp8_e4m3 | KV is wildly over-provisioned here anyway (NoPE + 96/108 non-attn layers). |
disable-piecewise-cuda-graph |
true | Hybrid graph doesn’t piecewise-capture cleanly; set by the model card. |
disable-deep-gemm |
true | DeepGemm targets FP8/ue8m0 scale format; on NVFP4 it JITs forever and exhausts host RAM. |
disable-cuda-graph |
false | Full CUDA graph. Eager mode is broken on the cutlass FP4 MoE path. |
Reasoning is on by default in the chat template and toggled per request via extra_body={"chat_template_kwargs":{"enable_thinking":false}} (or {"low_effort":true}) — not a launch flag. SGLang’s reasoning parser is nemotron_3 (vLLM/TRT-LLM use the super_v3/ultra_v3 plugin instead — don’t copy those here).
Results (peak = Σ per-request tok/s, n8 = 8 concurrent)
| mfs | ctx | fp4_gemm | moe | Status | n1 | n4 | n8 peak | n8 ok | n8 ttft |
|---|---|---|---|---|---|---|---|---|---|
| 0.85 | 262k | fi_cutlass | fi_cutlass | CRASH (mfs too low) | — | — | — | — | — |
| 0.88 | 262k | fi_cutlass | fi_cutlass | ok | 10.2 | 29.3 | 43.4 | 8/8 | 1.77 |
| 0.90 | 262k | fi_cutlass | fi_cutlass | ok | 10.1 | 27.9 | 42.0 | 8/8 | 1.95 |
| 0.92 | 262k | fi_cutlass | fi_cutlass | ok | 10.1 | 29.6 | 42.8 | 8/8 | 1.63 |
| 0.90 | 524k | fi_cutlass | fi_cutlass | ok | 10.0 | 29.5 | 43.2 | 8/8 | 2.37 |
| 0.90 | 262k | fi_cudnn | fi_cutlass | ok | 10.1 | 28.9 | 42.0 | 8/8 | 1.71 |
| 0.90 | 262k | fi_cutlass | triton | CRASH (cutlass assert) | — | — | — | — | — |
Per-request decode is a flat ~5.3 tok/s across every working case (it’s a 550B/55B-active model on TP=4 over RoCE). 512K context is essentially free (43.2 vs 42.0 @ 262k — within noise, 8/8 clean): NoPE + Mamba means KV barely grows. 1M is still untested.
The two startup traps
1. mem-fraction-static floor is 0.88 under EP=4, not 0.85.
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
This fires at KV profiling, not weight load. Under EP=4 the expert-dispatch buffers shrink rest_memory; at 0.85 the held-back reserve exceeds post-weight free memory, so the KV pool goes negative. Counter-intuitive name: raise mfs to fix it (more KV), don’t lower it.
2. moe-runner-backend triton crashes — flashinfer_cutlass is the only option.
sglang/srt/layers/moe/cutlass_moe.py:427
AssertionError: mismatch in expected `n` (nx2_w1 == intermediate_size_per_partition * 2)
ModelOptFp4 always routes the FFN through cutlass_moe_fp4; the triton runner flag is effectively ignored, and the LatentMoE/512-expert shape trips a hard cutlass assertion during CG capture. Confirmed on both the Super sibling and Ultra.
(Aside: flashinfer_cudnn for the FP4 GEMM ties flashinfer_cutlass on peak throughput here — the matrix “winner=cudnn” label is an aggregate total_tokens/wall_time artifact, not kernel speed. We keep flashinfer_cutlass: equal speed, and cuDNN had a startup-crash discrepancy on the Super sibling with the same image tag.)
Full sources (GitHub)
Everything below lives in our public Ansible + Python repo for this 4-node DGX Spark K3s cluster — github.com/vroomfondel/dgxarley:
- Test log (all cases, crash traces, memory breakdown, findings):
TESTLOGS/sglang_nn4_tp4_ep4/nemotron-3-ultra-550b-a55b-nvfp4/ - Model profile (canonical, fully annotated launch contract — every knob explained):
roles/k8s_dgx/model_profiles/nvidia-nvidia-nemotron-3-ultra-550b-a55b-nvfp4.yml - SGLang deployment role (head + workers, Multus + RoCE-over-SR-IOV, HAProxy sidecar):
roles/k8s_dgx/
Happy to answer questions on the RoCE/SR-IOV setup or the Ansible side. The Super-120B sibling and several Gemma/Qwen models have their own test logs in the same TESTLOGS/ tree if you’re comparing.