TL;DR — Multi-Token Prediction (MTP / EAGLE) on the NemotronH-NVFP4 Super finally
works on DGX Spark (SM121), with a real speedup, on a current SGLang dev build.
This is the gap the existing threads leave open: the published state for Super-120B + MTP
on Spark is “crashes” (vLLM, t/366660)
or “0% draft acceptance, accept_len = 1.00” (SGLang, sglang#21138).
On a build carrying the June-2026 NemotronH-MTP fixes we get accept_len ≈ 2.7,
1.70× single-stream and 1.37× at 8-way concurrency over the no-spec baseline — and,
notably, the 3-step / 4-draft depth beats NVIDIA’s own cookbook 5/5 recipe.This is the sibling of the earlier
Nemotron-3-Ultra-550B post —
same 4× DGX Spark cluster, same RoCE setup, this time the 120B Super with MTP on.
1. Hardware / software
| Component | Value |
|---|---|
| Nodes | 4× DGX Spark (ASUS Ascent GX10), GB10 / SM121 Blackwell, 128 GB each, 1 GPU/node |
| Topology | 1 head + 3 workers, orchestrated on K3s; control-plane on a separate x86 box (no GPU) |
| Driver | 580.159.03 |
| CUDA | 13.0 (host toolkit 13.0.3) |
| Kernel / OS | 6.17.0-1021-nvidia, Ubuntu 24.04.4 LTS (aarch64) |
| Interconnect | QSFP RoCE over ConnectX-7 SR-IOV VFs, MTU 9000 (NCCL transport = RoCE) |
| NCCL | 2.30.4 |
| SGLang | 0.5.13-dev (image: xomoxcc/dgx-spark-sglang:0.5.13-dev-nemotronh-mtp-sm121) |
| Model | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (NemotronHForCausalLM, Mamba2 + MoE + attn hybrid; ~67 GB NVFP4 weights, mixed precision) |
Image caveat (important for reproduction): MTP on the NemotronH-NVFP4 path needs a build
that carries the June-2026 fixes. The mainline upstreamscitrera/dgx-spark-sglang:0.5.12
used for the Ultra post does not — it boots the model fine but MTP either no-ops
(accept_len ≈ 1) or isn’t wired. See §6 for the exact PRs this build includes.
2. The recipe (winner)
--model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
--trust-remote-code
--quantization modelopt_fp4
--tp-size 4 --pp-size 1 # 4 nodes, single TP group, no pipeline stages
--nnodes 4
--context-length 524288 # + json-model-override max_position_embeddings=524288
--mem-fraction-static 0.80
--attention-backend flashinfer # triton attn is hard-asserted off on NemotronH
--moe-runner-backend flashinfer_cutlass
--fp4-gemm-backend flashinfer_cutlass
--kv-cache-dtype fp8_e4m3
--reasoning-parser nemotron_3 --tool-call-parser qwen3_coder
--disable-piecewise-cuda-graph # full CUDA-graph, piecewise off (per the model card)
# ---- MTP / speculative decoding (built-in MTP layer, EAGLE-style) ----
--speculative-algorithm EAGLE
--speculative-num-steps 3 # <-- WINNER. NOT the cookbook's 5.
--speculative-num-draft-tokens 4 # <-- WINNER. NOT the cookbook's 5.
--speculative-eagle-topk 1
No --disable-radix-cache is needed on this build (it was a requirement on older nightlies;
sglang#27998 removed it).
3. Results
Methodology: each request generates up to 3072 tokens; n=N = N concurrent requests.
“peak” = Σ of per-request tok/s over the successful requests (i.e. the steady-state
aggregate decode rate), not total_tokens / wall_time. ok = successful / failed
(a “failed” request here is one the harness’s repetition detector flagged, not a server error).
TP=4, EP=1 (tensor-parallel MoE) — the headline run
| Config | n=1 tok/s | n=4 peak | n=8 peak | n=8 ok | accept_len | NaN? |
|---|---|---|---|---|---|---|
| no-spec (baseline) | 31.67 | 95.0 | 146.3 | 8/0 | — | — |
| MTP 3 / 4 (winner) | 53.99 | 136.8 | 199.7 | 8/0 | ≈ 2.7 | no |
| MTP 5 / 5 (NVIDIA cookbook) | 51.5 | 124.0 | 152.7 | 7/1 | ≈ 3.0 | no |
| MTP 5 / 7 | 53.21 | 120.4 | 175.1 | 8/0 | ≈ 2.9 | no |
Speedup of the 3/4 winner vs no-spec: 1.70× single-stream, 1.37× at n=8. Clean 8/8,
coherent output, zero NaN.
Why 3/4 and not the cookbook 5/5
This is the surprising part. NVIDIA’s
Advanced Deployment Guide
recommends EAGLE steps=5, draft=5. On this model + image, 5/5 loses to 3/4:
- 5/5 has the higher accept_len (≈ 3.0 vs ≈ 2.7) but lower net throughput — the extra
draft compute per step costs more than the extra accepted tokens save. - 5/5 also tripped the repetition detector on one of the 8 concurrent requests (7/8),
collapsing its n=8 peak to 152.7 — barely above the no-spec 146.3. - 5/7 (the TRT-LLM “accept-3.45” depth) boots clean but is also slower than 3/4 (175.1).
So on DGX Spark / SM121, shorter is better: steps=3 / draft=4 is the throughput optimum.
4. MTP findings (the point of this post)
- MTP works on NemotronH-NVFP4 / SM121.
accept_lenis 2.7, not the
sglang#21138 “0.33 / accept_len = 1.00”
loader-bug signature. The MTP weight-loader fix (it had been filteringlm_head.weight+
backbone.embeddingsout of the draft path) is in this build. - No NaN logits. The NVFP4 MTP target-logits NaN that was chased on
release/v0.5.13
(sglang#27828) does not appear here — grepped
every decode log across all MTP runs. - 3/4 > cookbook 5/5 (see §3) — re-tune the draft depth down on this hardware.
- It’s reproducible at scale: 8/8 clean at ctx 524 K with MTP buffers + KV co-resident.
5. Startup traps (same spirit as the Ultra post)
These are the non-obvious things that cause a boot-time crash or silent perf loss:
- Attention backend must be
flashinfer. SGLang hard-asserts triton attention off for
NemotronH (“the first layer might not be an attention layer” — the hybrid pattern starts with
Mamba). flashinfer works (head_dim 128). - MoE runner must be
flashinfer_cutlass.tritonstartup-crashes with
AssertionError: mismatch in expected nincutlass_moe_fp4— the triton flag is ignored on
the NVFP4 modelopt path, which always dispatches throughcutlass_moe_fp4, and the
LatentMoE / 512-expert shape trips the assert. - fp4 GEMM: stay on
flashinfer_cutlass.flashinfer_cudnncrashes with
RuntimeError: cuDNN is not available— thenvidia-cudnn-cu12wheel isn’t in the image. - Concurrency is gated by the Mamba state pool, not KV.
max_running_requestsis clamped to
max_mamba_cache_size // per-request-slots. With MTP this per-request reservation grows
(extra “intermediate” SSM/conv caches for draft verification): a pool of 96 slots that gives
~32 parallel without spec gives only ~19 parallel with MTP. Size the pool accordingly if
you need high concurrency. - 512 K context is essentially free. NoPE (no positional embeddings; Mamba carries order) +
80 of 88 layers being Mamba means KV barely grows — 262 K / 512 K / 1 M all run at ~the same
throughput. No RoPE/YaRN scaling needed; just raisecontext-lengthand lift the config cap
viajson-model-override-args.
6. EP=4 vs EP=1 (bonus)
We also ran the whole matrix again with expert-parallel MoE (ep-size=4, all-to-all dispatch)
instead of tensor-parallel (ep-size=1), MTP 3/4 held identical:
| Metric (MTP 3/4) | EP=1 | EP=4 | Winner |
|---|---|---|---|
| n=1 tok/s | 53.99 | 58.85 | EP=4 (+9 %) |
| n=4 peak | 136.8 | 141.6 | EP=4 (+3.5 %) |
| n=8 peak | 199.7 | ~200 (524 K: 201.5, clean 8/8) | tie |
| accept_len | 2.67 | 2.74 | tie |
EP=4’s win is at low concurrency (single-stream +9 %): at n=1 the TP-MoE all-reduce latency
per layer dominates, and the expert-parallel all-to-all on small token counts is cheaper. By n=8
the all-to-all overhead catches up and it’s a wash. accept_len / NaN / the §5 crashes are all
identical between EP=1 and EP=4 — so MTP behaviour is unchanged; the delta is pure MoE
parallelism. If single-user latency matters, EP=4; if you only care about aggregate throughput
under load, either is fine.
7. Which SGLang PRs make MTP work here
For anyone trying to reproduce on their own build, these are the NemotronH-MTP changes this image
carries (all June 2026):
- #24955 — Support Nemotron DP attention and MTP
- #28102 — Fix DP attention + EP mode of Nemotron
- #27184 — Fix Nemotron Super MTP deploy (spec-v2 / B200)
- #27998 — NemotronH MTP with radix cache (removes the
--disable-radix-cacherequirement;
GB10-validated) - the #21138 weight-loader fix (stop filtering
lm_head.weight/backbone.embeddingsout
of the MTP draft path) — this is what lifts accept_len off 1.00
Upstream scitrera/dgx-spark-sglang:0.5.12 does not carry these — it boots the model but MTP
does not pay off. You need a 0.5.13-dev / current-main build.
8. Full sources
- Repo (Ansible + test logs + the SGLang SM121 build recipe): github.com/vroomfondel/dgxarley
- This run’s test logs (full matrices, per-case numbers, crash signatures):
- Super-120B EP=1 (the headline run):
TESTLOGS/sglang_nn4_tp4_ep1/nemotron-3-super-120b-a12b-nvfp4/TESTLOG_nv580.159_sglang-0.5.13-mtp_..._4n.md - Super-120B EP=4:
TESTLOGS/sglang_nn4_tp4_ep4/nemotron-3-super-120b-a12b-nvfp4/TESTLOG_nv580.159_sglang-0.5.13-mtp_..._4n.md - Model profile (the exact serving config):
roles/k8s_dgx/model_profiles/nvidia-nvidia-nemotron-3-super-120b-a12b-nvfp4.yml
- Super-120B EP=1 (the headline run):
- SGLang SM121 build recipe (how the image is built):
scripts/build_sm121_image.sh - Sibling post: Nemotron-3-Ultra-550B-A55B-NVFP4 on 4× DGX Spark
- Related prior art (MTP previously broken on Spark): vLLM crash thread
t/366660,
SGLang accept-rate bug sglang#21138