Nemotron-3-Super-120B-A12B-NVFP4 + MTP on 4× DGX Spark via SGLang (TP=4, RoCE) - MTP actually pays off: 1.70× single-stream, accept-len ≈ 2.7

ht12 · June 17, 2026, 1:08pm

TL;DR — Multi-Token Prediction (MTP / EAGLE) on the NemotronH-NVFP4 Super finally
works on DGX Spark (SM121), with a real speedup, on a current SGLang dev build.
This is the gap the existing threads leave open: the published state for Super-120B + MTP
on Spark is “crashes” (vLLM, t/366660)
or “0% draft acceptance, accept_len = 1.00” (SGLang, sglang#21138).
On a build carrying the June-2026 NemotronH-MTP fixes we get accept_len ≈ 2.7,
1.70× single-stream and 1.37× at 8-way concurrency over the no-spec baseline — and,
notably, the 3-step / 4-draft depth beats NVIDIA’s own cookbook 5/5 recipe.

This is the sibling of the earlier
Nemotron-3-Ultra-550B post —
same 4× DGX Spark cluster, same RoCE setup, this time the 120B Super with MTP on.

1. Hardware / software

Component	Value
Nodes	4× DGX Spark (ASUS Ascent GX10), GB10 / SM121 Blackwell, 128 GB each, 1 GPU/node
Topology	1 head + 3 workers, orchestrated on K3s; control-plane on a separate x86 box (no GPU)
Driver	580.159.03
CUDA	13.0 (host toolkit 13.0.3)
Kernel / OS	6.17.0-1021-nvidia, Ubuntu 24.04.4 LTS (aarch64)
Interconnect	QSFP RoCE over ConnectX-7 SR-IOV VFs, MTU 9000 (NCCL transport = RoCE)
NCCL	2.30.4
SGLang	0.5.13-dev (image: `xomoxcc/dgx-spark-sglang:0.5.13-dev-nemotronh-mtp-sm121`)
Model	`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` (`NemotronHForCausalLM`, Mamba2 + MoE + attn hybrid; ~67 GB NVFP4 weights, mixed precision)

Image caveat (important for reproduction): MTP on the NemotronH-NVFP4 path needs a build
that carries the June-2026 fixes. The mainline upstream scitrera/dgx-spark-sglang:0.5.12
used for the Ultra post does not — it boots the model fine but MTP either no-ops
(accept_len ≈ 1) or isn’t wired. See §6 for the exact PRs this build includes.

2. The recipe (winner)

--model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
--trust-remote-code
--quantization modelopt_fp4
--tp-size 4 --pp-size 1            # 4 nodes, single TP group, no pipeline stages
--nnodes 4
--context-length 524288           # + json-model-override max_position_embeddings=524288
--mem-fraction-static 0.80
--attention-backend flashinfer    # triton attn is hard-asserted off on NemotronH
--moe-runner-backend flashinfer_cutlass
--fp4-gemm-backend flashinfer_cutlass
--kv-cache-dtype fp8_e4m3
--reasoning-parser nemotron_3 --tool-call-parser qwen3_coder
--disable-piecewise-cuda-graph    # full CUDA-graph, piecewise off (per the model card)
# ---- MTP / speculative decoding (built-in MTP layer, EAGLE-style) ----
--speculative-algorithm EAGLE
--speculative-num-steps 3         # <-- WINNER. NOT the cookbook's 5.
--speculative-num-draft-tokens 4  # <-- WINNER. NOT the cookbook's 5.
--speculative-eagle-topk 1

No --disable-radix-cache is needed on this build (it was a requirement on older nightlies;
sglang#27998 removed it).

3. Results

Methodology: each request generates up to 3072 tokens; n=N = N concurrent requests.
“peak” = Σ of per-request tok/s over the successful requests (i.e. the steady-state
aggregate decode rate), not total_tokens / wall_time. ok = successful / failed
(a “failed” request here is one the harness’s repetition detector flagged, not a server error).

TP=4, EP=1 (tensor-parallel MoE) — the headline run

Config	n=1 tok/s	n=4 peak	n=8 peak	n=8 ok	accept_len	NaN?
no-spec (baseline)	31.67	95.0	146.3	8/0	—	—
MTP 3 / 4 (winner)	53.99	136.8	199.7	8/0	≈ 2.7	no
MTP 5 / 5 (NVIDIA cookbook)	51.5	124.0	152.7	7/1	≈ 3.0	no
MTP 5 / 7	53.21	120.4	175.1	8/0	≈ 2.9	no

Speedup of the 3/4 winner vs no-spec: 1.70× single-stream, 1.37× at n=8. Clean 8/8,
coherent output, zero NaN.

Why 3/4 and not the cookbook 5/5

This is the surprising part. NVIDIA’s
Advanced Deployment Guide
recommends EAGLE steps=5, draft=5. On this model + image, 5/5 loses to 3/4:

5/5 has the higher accept_len (≈ 3.0 vs ≈ 2.7) but lower net throughput — the extra
draft compute per step costs more than the extra accepted tokens save.
5/5 also tripped the repetition detector on one of the 8 concurrent requests (7/8),
collapsing its n=8 peak to 152.7 — barely above the no-spec 146.3.
5/7 (the TRT-LLM “accept-3.45” depth) boots clean but is also slower than 3/4 (175.1).

So on DGX Spark / SM121, shorter is better: steps=3 / draft=4 is the throughput optimum.

4. MTP findings (the point of this post)

MTP works on NemotronH-NVFP4 / SM121. accept_len is 2.7, not the
sglang#21138 “0.33 / accept_len = 1.00”
loader-bug signature. The MTP weight-loader fix (it had been filtering lm_head.weight +
backbone.embeddings out of the draft path) is in this build.
No NaN logits. The NVFP4 MTP target-logits NaN that was chased on release/v0.5.13
(sglang#27828) does not appear here — grepped
every decode log across all MTP runs.
3/4 > cookbook 5/5 (see §3) — re-tune the draft depth down on this hardware.
It’s reproducible at scale: 8/8 clean at ctx 524 K with MTP buffers + KV co-resident.

5. Startup traps (same spirit as the Ultra post)

These are the non-obvious things that cause a boot-time crash or silent perf loss:

Attention backend must be flashinfer. SGLang hard-asserts triton attention off for
NemotronH (“the first layer might not be an attention layer” — the hybrid pattern starts with
Mamba). flashinfer works (head_dim 128).
MoE runner must be flashinfer_cutlass. triton startup-crashes with
AssertionError: mismatch in expected n in cutlass_moe_fp4 — the triton flag is ignored on
the NVFP4 modelopt path, which always dispatches through cutlass_moe_fp4, and the
LatentMoE / 512-expert shape trips the assert.
fp4 GEMM: stay on flashinfer_cutlass. flashinfer_cudnn crashes with
RuntimeError: cuDNN is not available — the nvidia-cudnn-cu12 wheel isn’t in the image.
Concurrency is gated by the Mamba state pool, not KV. max_running_requests is clamped to
max_mamba_cache_size // per-request-slots. With MTP this per-request reservation grows
(extra “intermediate” SSM/conv caches for draft verification): a pool of 96 slots that gives
~32 parallel without spec gives only ~19 parallel with MTP. Size the pool accordingly if
you need high concurrency.
512 K context is essentially free. NoPE (no positional embeddings; Mamba carries order) +
80 of 88 layers being Mamba means KV barely grows — 262 K / 512 K / 1 M all run at ~the same
throughput. No RoPE/YaRN scaling needed; just raise context-length and lift the config cap
via json-model-override-args.

6. EP=4 vs EP=1 (bonus)

We also ran the whole matrix again with expert-parallel MoE (ep-size=4, all-to-all dispatch)
instead of tensor-parallel (ep-size=1), MTP 3/4 held identical:

Metric (MTP 3/4)	EP=1	EP=4	Winner
n=1 tok/s	53.99	58.85	EP=4 (+9 %)
n=4 peak	136.8	141.6	EP=4 (+3.5 %)
n=8 peak	199.7	~200 (524 K: 201.5, clean 8/8)	tie
accept_len	2.67	2.74	tie

EP=4’s win is at low concurrency (single-stream +9 %): at n=1 the TP-MoE all-reduce latency
per layer dominates, and the expert-parallel all-to-all on small token counts is cheaper. By n=8
the all-to-all overhead catches up and it’s a wash. accept_len / NaN / the §5 crashes are all
identical between EP=1 and EP=4 — so MTP behaviour is unchanged; the delta is pure MoE
parallelism. If single-user latency matters, EP=4; if you only care about aggregate throughput
under load, either is fine.

7. Which SGLang PRs make MTP work here

For anyone trying to reproduce on their own build, these are the NemotronH-MTP changes this image
carries (all June 2026):

#24955 — Support Nemotron DP attention and MTP
#28102 — Fix DP attention + EP mode of Nemotron
#27184 — Fix Nemotron Super MTP deploy (spec-v2 / B200)
#27998 — NemotronH MTP with radix cache (removes the --disable-radix-cache requirement;
GB10-validated)
the #21138 weight-loader fix (stop filtering lm_head.weight / backbone.embeddings out
of the MTP draft path) — this is what lifts accept_len off 1.00

Upstream scitrera/dgx-spark-sglang:0.5.12 does not carry these — it boots the model but MTP
does not pay off. You need a 0.5.13-dev / current-main build.

8. Full sources

Repo (Ansible + test logs + the SGLang SM121 build recipe): github.com/vroomfondel/dgxarley
This run’s test logs (full matrices, per-case numbers, crash signatures):
- Super-120B EP=1 (the headline run): TESTLOGS/sglang_nn4_tp4_ep1/nemotron-3-super-120b-a12b-nvfp4/TESTLOG_nv580.159_sglang-0.5.13-mtp_..._4n.md
- Super-120B EP=4: TESTLOGS/sglang_nn4_tp4_ep4/nemotron-3-super-120b-a12b-nvfp4/TESTLOG_nv580.159_sglang-0.5.13-mtp_..._4n.md
- Model profile (the exact serving config): roles/k8s_dgx/model_profiles/nvidia-nvidia-nemotron-3-super-120b-a12b-nvfp4.yml
SGLang SM121 build recipe (how the image is built): scripts/build_sm121_image.sh
Sibling post: Nemotron-3-Ultra-550B-A55B-NVFP4 on 4× DGX Spark
Related prior art (MTP previously broken on Spark): vLLM crash thread
t/366660,
SGLang accept-rate bug sglang#21138

truetotosse · June 17, 2026, 1:24pm

What is the point of having nemotron super on 4x sparks?

ht12 · June 17, 2026, 2:52pm

Up front, a fair concession. If you want raw aggregate throughput, the config
in this post (tp=4, ep=4) isn’t how you’d get it — splitting one model across four
nodes pays an all-reduce tax every layer. The axis that actually scales is
--dp-size 4 without --enable-dp-attention: four full replicas behind SGLang’s
router, zero cross-node comms, ~4× the single-node rate, and the model fits on one
128 GB Spark so replication works. We haven’t benched that path yet (--dp-size + MTP
is untested for us), so it’s the honest “what we’d reach for next,” not a measured claim.

So why TP=4 at all?

Because the point of the post isn’t the topology — it’s that MTP works, which it
didn’t before on any node count. The published state was:

“crashes” — vLLM, t/366660
“0 % acceptance, accept_len = 1.00” — SGLang, sglang#21138

…single-node attempts included. On a build with the June-2026 NemotronH-MTP fixes it
pays off: accept_len ≈ 2.7, 1.70× single-stream, and 3/4 beating NVIDIA’s own
cookbook 5/5. That’s the result worth sharing. TP=4 was just the bench the cluster
already serves one model on, and the recipe holds regardless of how you parallelize.

“Why not one Spark, then?” — we actually tried it

This is the interesting part. The weights are only ~67 GB, so they fit on a single
128 GB box. But on this hybrid the concurrency limit isn’t KV — it’s the Mamba state
pool:

max_running_requests = max_mamba_cache_size // per-request-slots

On one Spark the full model loads unsharded to ~72.7 GB, leaving just ~37.5 GB
free. The Mamba pool we run on the cluster (96 slots) wants ~27 GB on top of KV +
CUDA graphs — which doesn’t fit:

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

We did get it running single-node, but only after two cuts (and the engine’s own
“increase --mem-fraction-static” advice is a red herring — avail mem after weights
is physically pinned at ~37.5 GB regardless of the fraction):

shrink the Mamba pool (max_mamba_cache_size 96 → 24) to clear the pool-init OOM, then
lower mem-fraction-static (≈0.78) so the KV pool stops eating the headroom the
MTP draft head (~7 GB) needs — raising it is the wrong reflex in both phases.

Measured single-node memory budget (one Spark, ~121 GB usable, MTP on):

Item	GB
Base weights (unsharded)	72.7
Mamba pool (24 slots)	7.1
KV cache (885k tokens, fp8)	3.4
MTP draft head	7.0
Draft KV + CUDA graphs	~1.5
free at the end	~16

The result boots and serves — MTP works on one Spark too (accept_len ≈ 2.4) — but the
concurrency ceiling is max_running_requests = 4, gated by the Mamba state pool, vs
the cluster’s ~32 (≈19 with MTP). Under TP=4 that pool is sharded across nodes (~17 GB
weights/node), so it fits at full size.

And there’s a second ceiling the KV cut introduces: with the pool trimmed to ~885k
tokens, those 4 slots share a single KV budget — ~221k tokens each if all four run at
once. So a single Spark can serve either 4 concurrent at moderate context, or fewer at
the full 512 K — not four long-context requests at the same time. Sharding across nodes
restores both ceilings: more Mamba slots (parallelism) and more aggregate KV (long
context × concurrency together) — which is the whole reason the model’s 512 K context is
worth anything in a multi-user setting.

The honest shape of it

Four Sparks aren’t required — they buy more usable concurrency × context by
sharding the Mamba and KV pools, not single-stream magic.

If we’re being fully honest, two Sparks would probably be the sweet spot:

TP=2 halves the all-reduce of TP=4
restores a healthy Mamba pool (~36 GB weights/node, tons of headroom)
KV pool no longer has to be starved to fit the MTP head
model still fits with room to spare

We just happen to have four, so four is what got benched.

The KV-cache angle is backwards here

The reason people usually cite for going multi-node — distributing a growing KV cache —
doesn’t apply. NemotronH is a hybrid: only 8 of 88 layers are attention (the other 80
are Mamba, carrying a fixed recurrent state), so KV barely grows (~4 KB/token) and 512 K
context is nearly free per request. The thing actually worth distributing is the Mamba
state pool — and, as the single-node run shows, the aggregate KV budget once you want
long context across several requests at once.

So, to actually answer the question honestly: if you only ever do single requests, one
Spark is completely sufficient. It boots, it serves, MTP works, and a lone request gets
the full 512 K context — the second node earns nothing for you. Four Sparks only start to
pay off the moment you want many requests in flight and long context at the same time:
that’s when the sharded Mamba + KV pools turn into real, usable headroom.

We have four, so the cluster just runs it — clean 8/8, MTP on, no babysitting. But
that was never the headline. The headline is that MTP works; the node count is just
where you land on the concurrency-vs-cost curve. One Spark for yourself, two for a
comfortable sweet spot, four if you’re serving a crowd.

truetotosse · June 17, 2026, 3:05pm

I understand you use claude for the replies, but you could put some effort into it at least.

ht12 · June 17, 2026, 3:13pm

Actually, I did. Sorry to not have met your expecations.

nrevo · June 18, 2026, 4:24am

I am using this model + MTP on 2 DGX Sparks with vllm. I’ll attach my recipe, but will be trying some of the suggestions from here as well. I am launching with the sparkrun project and a locally build vllm-node image from the spark-vllm-docker project. When I return from camping next week I’ll figure out how to host the recipe in a proper repo.

run.sh.txt (795 Bytes)

recipe.yaml.txt (1.6 KB)

Edit: I forgot to mention I modified the sparkrun-vllm-docker/Dockerfile with

ARG CUDA_IMAGE=nvidia/cuda:13.3.0-devel-ubuntu24.04

Topic		Replies	Views
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	829	April 16, 2026
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 DGX Spark / GB10 nemotron	31	2056	June 10, 2026
Nemotron-3-Ultra-550B-A55B-NVFP4 on 4× DGX Spark via SGLang (TP=4 EP=4, RoCE) — it works, ~42–43 tok/s n8 peak DGX Spark / GB10 Projects cudnn , llama , nemotron	0	304	June 9, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	10020	March 31, 2026
Nemotron-3-Super-120B-A12B-NVFP4 on single DGX Spark: 23.45 tok/s (spark-arena.com/ benhmarks) DGX Spark / GB10 cuda , benchmarks , spark , llm , nemotron , dgx , nemoclaw	6	904	May 26, 2026
Nemotron-3-Super-120B at 20-22 tok/s Super Special Recipe DGX Spark / GB10 nemotron	4	778	May 30, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	2209	December 22, 2025
Multi-node DGX Spark + SGLang win: Gemma-4-31B + MTP — +80 % @ n=8 (153 tok/s) on 4× GB10 DGX Spark / GB10 Projects	0	466	May 16, 2026
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	25	2383	March 24, 2026
6x Spark setup DGX Spark / GB10	112	10207	April 25, 2026