Nemotron-3-Ultra-550B-A55B-NVFP4 on 4× DGX Spark via SGLang (TP=4 EP=4, RoCE) — it works, ~42–43 tok/s n8 peak

There are a few open threads here (NVFP4 on 4 Sparks, capacity planning on GB10) where people are trying to get the NVFP4 Ultra running with SGLang/vLLM on a 4-node Spark cluster and reporting “no joy so far.” The only confirmed Ultra success so far has been the 2-bit GGUF / llama.cpp-RPC route on 2 Sparks (~5 tok/s).

We got the native NVFP4 weights serving on SGLang across 4 DGX Sparks and want to share the exact recipe, because two non-obvious traps will hard-crash you at startup otherwise.

TL;DR: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 boots and serves on the mainline upstream image scitrera/dgx-spark-sglang:0.5.12 (no custom build needed) at TP=4, EP=4 over RoCE. 8/8 concurrent requests served clean, output coherent, ~5.3 tok/s/request, ~42–43 tok/s n8 peak, 512K context essentially free.

Full test log (all 7 matrix cases, crash traces, memory breakdown): TESTLOG on GitHub


Hardware / software

Component Value
Nodes 4× DGX Spark (GB10, SM121/Blackwell), 128 GB unified mem each, 1 GPU per node
Topology spark1 = head, spark2–4 = workers
Driver 580.159
Kernel 6.17.0-1018-nvidia
OS Ubuntu 24.04 LTS (aarch64)
Interconnect 200GbE ConnectX-7, RoCE over SR-IOV VF (NCCL transport roce)
Image scitrera/dgx-spark-sglang:0.5.12upstream base, no custom kernel build
Orchestration K3s v1.35.3, deployed via Ansible (links below)

The model is NemotronHForCausalLM (model_type=nemotron_h) — a Mamba2 + MoE + attention hybrid: 108 layers (48 mamba / 48 moe / 12 attention), 550B total / 55B active LatentMoE, 512 routed + 1 shared experts, NoPE (no RoPE — Mamba2 carries order), native ctx cap 262144. Quant is modelopt_mixed (FP4 expert FFN @ group_size 16, FP8/BF16 for attention/latent/embeddings).


The recipe

Parallelism: TP=4, PP=1, EP=4 (expert-parallel; the 512 experts shard 128/GPU). Weights land at 83.7 GB/GPU (note: this is less than the ~107 GB/GPU you’d naively estimate — the mixed-precision attn/latent/embedding tensors are smaller than the FP4 experts), load ~490 s.

Validated SGLang launch knobs (head shown; workers identical bar --node-rank):

python3 -m sglang.launch_server \
  --model-path nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --trust-remote-code \
  --quantization modelopt_fp4 \
  --tp-size 4 --pp-size 1 --ep-size 4 \
  --nnodes 4 --node-rank 0 --dist-init-addr <head-ip>:<port> \
  --context-length 262144 \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.9 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --max-mamba-cache-size 48 \
  --cuda-graph-max-bs 8 \
  --disable-piecewise-cuda-graph \
  --disable-deep-gemm \
  --reasoning-parser nemotron_3 \
  --tool-call-parser qwen3_coder

Key knobs and why:

Knob Value Why
mem-fraction-static 0.9 This is a post-weight reserve knob (higher = more KV), not a vLLM-style fraction-of-total. EP=4 floor is 0.88 — 0.85 startup-crashes (KV pool goes negative). 0.88/0.90/0.92 all serve with flat throughput.
moe-runner-backend flashinfer_cutlass The only viable MoE runner. triton is silently ignored on the NVFP4 modelopt path and crashes in cutlass_moe_fp4 during CG capture (see trap #2).
attention-backend flashinfer triton is hard-asserted off on NemotronH (the hybrid starts with a Mamba layer, not attention).
max-mamba-cache-size 48 On this hybrid the Mamba state pool — not KV — is the concurrency ceiling: max_running_requests = mamba_cache // 3. 48 → 16 parallel. Default auto-fit gives only ~6.
kv-cache-dtype fp8_e4m3 KV is wildly over-provisioned here anyway (NoPE + 96/108 non-attn layers).
disable-piecewise-cuda-graph true Hybrid graph doesn’t piecewise-capture cleanly; set by the model card.
disable-deep-gemm true DeepGemm targets FP8/ue8m0 scale format; on NVFP4 it JITs forever and exhausts host RAM.
disable-cuda-graph false Full CUDA graph. Eager mode is broken on the cutlass FP4 MoE path.

Reasoning is on by default in the chat template and toggled per request via extra_body={"chat_template_kwargs":{"enable_thinking":false}} (or {"low_effort":true}) — not a launch flag. SGLang’s reasoning parser is nemotron_3 (vLLM/TRT-LLM use the super_v3/ultra_v3 plugin instead — don’t copy those here).


Results (peak = Σ per-request tok/s, n8 = 8 concurrent)

mfs ctx fp4_gemm moe Status n1 n4 n8 peak n8 ok n8 ttft
0.85 262k fi_cutlass fi_cutlass CRASH (mfs too low)
0.88 262k fi_cutlass fi_cutlass ok 10.2 29.3 43.4 8/8 1.77
0.90 262k fi_cutlass fi_cutlass ok 10.1 27.9 42.0 8/8 1.95
0.92 262k fi_cutlass fi_cutlass ok 10.1 29.6 42.8 8/8 1.63
0.90 524k fi_cutlass fi_cutlass ok 10.0 29.5 43.2 8/8 2.37
0.90 262k fi_cudnn fi_cutlass ok 10.1 28.9 42.0 8/8 1.71
0.90 262k fi_cutlass triton CRASH (cutlass assert)

Per-request decode is a flat ~5.3 tok/s across every working case (it’s a 550B/55B-active model on TP=4 over RoCE). 512K context is essentially free (43.2 vs 42.0 @ 262k — within noise, 8/8 clean): NoPE + Mamba means KV barely grows. 1M is still untested.


The two startup traps

1. mem-fraction-static floor is 0.88 under EP=4, not 0.85.

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

This fires at KV profiling, not weight load. Under EP=4 the expert-dispatch buffers shrink rest_memory; at 0.85 the held-back reserve exceeds post-weight free memory, so the KV pool goes negative. Counter-intuitive name: raise mfs to fix it (more KV), don’t lower it.

2. moe-runner-backend triton crashes — flashinfer_cutlass is the only option.

sglang/srt/layers/moe/cutlass_moe.py:427
AssertionError: mismatch in expected `n`  (nx2_w1 == intermediate_size_per_partition * 2)

ModelOptFp4 always routes the FFN through cutlass_moe_fp4; the triton runner flag is effectively ignored, and the LatentMoE/512-expert shape trips a hard cutlass assertion during CG capture. Confirmed on both the Super sibling and Ultra.

(Aside: flashinfer_cudnn for the FP4 GEMM ties flashinfer_cutlass on peak throughput here — the matrix “winner=cudnn” label is an aggregate total_tokens/wall_time artifact, not kernel speed. We keep flashinfer_cutlass: equal speed, and cuDNN had a startup-crash discrepancy on the Super sibling with the same image tag.)


Full sources (GitHub)

Everything below lives in our public Ansible + Python repo for this 4-node DGX Spark K3s cluster — github.com/vroomfondel/dgxarley:

Happy to answer questions on the RoCE/SR-IOV setup or the Ansible side. The Super-120B sibling and several Gemma/Qwen models have their own test logs in the same TESTLOGS/ tree if you’re comparing.

6 Likes