Nemotron-3-Super-120B-A12B-NVFP4 + MTP on 4× DGX Spark via SGLang (TP=4, RoCE) - MTP actually pays off: 1.70× single-stream, accept-len ≈ 2.7

TL;DR — Multi-Token Prediction (MTP / EAGLE) on the NemotronH-NVFP4 Super finally
works on DGX Spark (SM121), with a real speedup, on a current SGLang dev build.
This is the gap the existing threads leave open: the published state for Super-120B + MTP
on Spark is “crashes” (vLLM, t/366660)
or “0% draft acceptance, accept_len = 1.00” (SGLang, sglang#21138).
On a build carrying the June-2026 NemotronH-MTP fixes we get accept_len ≈ 2.7,
1.70× single-stream and 1.37× at 8-way concurrency over the no-spec baseline — and,
notably, the 3-step / 4-draft depth beats NVIDIA’s own cookbook 5/5 recipe.

This is the sibling of the earlier
Nemotron-3-Ultra-550B post
same 4× DGX Spark cluster, same RoCE setup, this time the 120B Super with MTP on.


1. Hardware / software

Component Value
Nodes 4× DGX Spark (ASUS Ascent GX10), GB10 / SM121 Blackwell, 128 GB each, 1 GPU/node
Topology 1 head + 3 workers, orchestrated on K3s; control-plane on a separate x86 box (no GPU)
Driver 580.159.03
CUDA 13.0 (host toolkit 13.0.3)
Kernel / OS 6.17.0-1021-nvidia, Ubuntu 24.04.4 LTS (aarch64)
Interconnect QSFP RoCE over ConnectX-7 SR-IOV VFs, MTU 9000 (NCCL transport = RoCE)
NCCL 2.30.4
SGLang 0.5.13-dev (image: xomoxcc/dgx-spark-sglang:0.5.13-dev-nemotronh-mtp-sm121)
Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (NemotronHForCausalLM, Mamba2 + MoE + attn hybrid; ~67 GB NVFP4 weights, mixed precision)

Image caveat (important for reproduction): MTP on the NemotronH-NVFP4 path needs a build
that carries the June-2026 fixes. The mainline upstream scitrera/dgx-spark-sglang:0.5.12
used for the Ultra post does not — it boots the model fine but MTP either no-ops
(accept_len ≈ 1) or isn’t wired. See §6 for the exact PRs this build includes.


2. The recipe (winner)

--model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
--trust-remote-code
--quantization modelopt_fp4
--tp-size 4 --pp-size 1            # 4 nodes, single TP group, no pipeline stages
--nnodes 4
--context-length 524288           # + json-model-override max_position_embeddings=524288
--mem-fraction-static 0.80
--attention-backend flashinfer    # triton attn is hard-asserted off on NemotronH
--moe-runner-backend flashinfer_cutlass
--fp4-gemm-backend flashinfer_cutlass
--kv-cache-dtype fp8_e4m3
--reasoning-parser nemotron_3 --tool-call-parser qwen3_coder
--disable-piecewise-cuda-graph    # full CUDA-graph, piecewise off (per the model card)
# ---- MTP / speculative decoding (built-in MTP layer, EAGLE-style) ----
--speculative-algorithm EAGLE
--speculative-num-steps 3         # <-- WINNER. NOT the cookbook's 5.
--speculative-num-draft-tokens 4  # <-- WINNER. NOT the cookbook's 5.
--speculative-eagle-topk 1

No --disable-radix-cache is needed on this build (it was a requirement on older nightlies;
sglang#27998 removed it).


3. Results

Methodology: each request generates up to 3072 tokens; n=N = N concurrent requests.
“peak” = Σ of per-request tok/s over the successful requests (i.e. the steady-state
aggregate decode rate), not total_tokens / wall_time. ok = successful / failed
(a “failed” request here is one the harness’s repetition detector flagged, not a server error).

TP=4, EP=1 (tensor-parallel MoE) — the headline run

Config n=1 tok/s n=4 peak n=8 peak n=8 ok accept_len NaN?
no-spec (baseline) 31.67 95.0 146.3 8/0
MTP 3 / 4 (winner) 53.99 136.8 199.7 8/0 ≈ 2.7 no
MTP 5 / 5 (NVIDIA cookbook) 51.5 124.0 152.7 7/1 ≈ 3.0 no
MTP 5 / 7 53.21 120.4 175.1 8/0 ≈ 2.9 no

Speedup of the 3/4 winner vs no-spec: 1.70× single-stream, 1.37× at n=8. Clean 8/8,
coherent output, zero NaN.

Why 3/4 and not the cookbook 5/5

This is the surprising part. NVIDIA’s
Advanced Deployment Guide
recommends EAGLE steps=5, draft=5. On this model + image, 5/5 loses to 3/4:

  • 5/5 has the higher accept_len (≈ 3.0 vs ≈ 2.7) but lower net throughput — the extra
    draft compute per step costs more than the extra accepted tokens save.
  • 5/5 also tripped the repetition detector on one of the 8 concurrent requests (7/8),
    collapsing its n=8 peak to 152.7 — barely above the no-spec 146.3.
  • 5/7 (the TRT-LLM “accept-3.45” depth) boots clean but is also slower than 3/4 (175.1).

So on DGX Spark / SM121, shorter is better: steps=3 / draft=4 is the throughput optimum.


4. MTP findings (the point of this post)

  1. MTP works on NemotronH-NVFP4 / SM121. accept_len is 2.7, not the
    sglang#21138 “0.33 / accept_len = 1.00”
    loader-bug signature. The MTP weight-loader fix (it had been filtering lm_head.weight +
    backbone.embeddings out of the draft path) is in this build.
  2. No NaN logits. The NVFP4 MTP target-logits NaN that was chased on release/v0.5.13
    (sglang#27828) does not appear here — grepped
    every decode log across all MTP runs.
  3. 3/4 > cookbook 5/5 (see §3) — re-tune the draft depth down on this hardware.
  4. It’s reproducible at scale: 8/8 clean at ctx 524 K with MTP buffers + KV co-resident.

5. Startup traps (same spirit as the Ultra post)

These are the non-obvious things that cause a boot-time crash or silent perf loss:

  • Attention backend must be flashinfer. SGLang hard-asserts triton attention off for
    NemotronH (“the first layer might not be an attention layer” — the hybrid pattern starts with
    Mamba). flashinfer works (head_dim 128).
  • MoE runner must be flashinfer_cutlass. triton startup-crashes with
    AssertionError: mismatch in expected n in cutlass_moe_fp4 — the triton flag is ignored on
    the NVFP4 modelopt path, which always dispatches through cutlass_moe_fp4, and the
    LatentMoE / 512-expert shape trips the assert.
  • fp4 GEMM: stay on flashinfer_cutlass. flashinfer_cudnn crashes with
    RuntimeError: cuDNN is not available — the nvidia-cudnn-cu12 wheel isn’t in the image.
  • Concurrency is gated by the Mamba state pool, not KV. max_running_requests is clamped to
    max_mamba_cache_size // per-request-slots. With MTP this per-request reservation grows
    (extra “intermediate” SSM/conv caches for draft verification): a pool of 96 slots that gives
    ~32 parallel without spec gives only ~19 parallel with MTP. Size the pool accordingly if
    you need high concurrency.
  • 512 K context is essentially free. NoPE (no positional embeddings; Mamba carries order) +
    80 of 88 layers being Mamba means KV barely grows — 262 K / 512 K / 1 M all run at ~the same
    throughput. No RoPE/YaRN scaling needed; just raise context-length and lift the config cap
    via json-model-override-args.

6. EP=4 vs EP=1 (bonus)

We also ran the whole matrix again with expert-parallel MoE (ep-size=4, all-to-all dispatch)
instead of tensor-parallel (ep-size=1), MTP 3/4 held identical:

Metric (MTP 3/4) EP=1 EP=4 Winner
n=1 tok/s 53.99 58.85 EP=4 (+9 %)
n=4 peak 136.8 141.6 EP=4 (+3.5 %)
n=8 peak 199.7 ~200 (524 K: 201.5, clean 8/8) tie
accept_len 2.67 2.74 tie

EP=4’s win is at low concurrency (single-stream +9 %): at n=1 the TP-MoE all-reduce latency
per layer dominates, and the expert-parallel all-to-all on small token counts is cheaper. By n=8
the all-to-all overhead catches up and it’s a wash. accept_len / NaN / the §5 crashes are all
identical between EP=1 and EP=4 — so MTP behaviour is unchanged; the delta is pure MoE
parallelism. If single-user latency matters, EP=4; if you only care about aggregate throughput
under load, either is fine.


7. Which SGLang PRs make MTP work here

For anyone trying to reproduce on their own build, these are the NemotronH-MTP changes this image
carries (all June 2026):

  • #24955 — Support Nemotron DP attention and MTP
  • #28102 — Fix DP attention + EP mode of Nemotron
  • #27184 — Fix Nemotron Super MTP deploy (spec-v2 / B200)
  • #27998 — NemotronH MTP with radix cache (removes the --disable-radix-cache requirement;
    GB10-validated)
  • the #21138 weight-loader fix (stop filtering lm_head.weight / backbone.embeddings out
    of the MTP draft path) — this is what lifts accept_len off 1.00

Upstream scitrera/dgx-spark-sglang:0.5.12 does not carry these — it boots the model but MTP
does not pay off. You need a 0.5.13-dev / current-main build.


8. Full sources

What is the point of having nemotron super on 4x sparks?

Up front, a fair concession. If you want raw aggregate throughput, the config
in this post (tp=4, ep=4) isn’t how you’d get it — splitting one model across four
nodes pays an all-reduce tax every layer. The axis that actually scales is
--dp-size 4 without --enable-dp-attention: four full replicas behind SGLang’s
router, zero cross-node comms, ~4× the single-node rate, and the model fits on one
128 GB Spark so replication works. We haven’t benched that path yet (--dp-size + MTP
is untested for us), so it’s the honest “what we’d reach for next,” not a measured claim.

So why TP=4 at all?

Because the point of the post isn’t the topology — it’s that MTP works, which it
didn’t before on any node count. The published state was:

…single-node attempts included. On a build with the June-2026 NemotronH-MTP fixes it
pays off: accept_len ≈ 2.7, 1.70× single-stream, and 3/4 beating NVIDIA’s own
cookbook 5/5
. That’s the result worth sharing. TP=4 was just the bench the cluster
already serves one model on, and the recipe holds regardless of how you parallelize.

“Why not one Spark, then?” — we actually tried it

This is the interesting part. The weights are only ~67 GB, so they fit on a single
128 GB box. But on this hybrid the concurrency limit isn’t KV — it’s the Mamba state
pool
:

max_running_requests = max_mamba_cache_size // per-request-slots

On one Spark the full model loads unsharded to ~72.7 GB, leaving just ~37.5 GB
free. The Mamba pool we run on the cluster (96 slots) wants ~27 GB on top of KV +
CUDA graphs — which doesn’t fit:

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

We did get it running single-node, but only after two cuts (and the engine’s own
“increase --mem-fraction-static” advice is a red herring — avail mem after weights
is physically pinned at ~37.5 GB regardless of the fraction):

  1. shrink the Mamba pool (max_mamba_cache_size 96 → 24) to clear the pool-init OOM, then
  2. lower mem-fraction-static (≈0.78) so the KV pool stops eating the headroom the
    MTP draft head (~7 GB) needs — raising it is the wrong reflex in both phases.

Measured single-node memory budget (one Spark, ~121 GB usable, MTP on):

Item GB
Base weights (unsharded) 72.7
Mamba pool (24 slots) 7.1
KV cache (885k tokens, fp8) 3.4
MTP draft head 7.0
Draft KV + CUDA graphs ~1.5
free at the end ~16

The result boots and serves — MTP works on one Spark too (accept_len ≈ 2.4) — but the
concurrency ceiling is max_running_requests = 4, gated by the Mamba state pool, vs
the cluster’s ~32 (≈19 with MTP). Under TP=4 that pool is sharded across nodes (~17 GB
weights/node), so it fits at full size.

And there’s a second ceiling the KV cut introduces: with the pool trimmed to ~885k
tokens, those 4 slots share a single KV budget — ~221k tokens each if all four run at
once. So a single Spark can serve either 4 concurrent at moderate context, or fewer at
the full 512 K — not four long-context requests at the same time.
Sharding across nodes
restores both ceilings: more Mamba slots (parallelism) and more aggregate KV (long
context × concurrency together) — which is the whole reason the model’s 512 K context is
worth anything in a multi-user setting.

The honest shape of it

Four Sparks aren’t required — they buy more usable concurrency × context by
sharding the Mamba and KV pools, not single-stream magic.

If we’re being fully honest, two Sparks would probably be the sweet spot:

  • TP=2 halves the all-reduce of TP=4
  • restores a healthy Mamba pool (~36 GB weights/node, tons of headroom)
  • KV pool no longer has to be starved to fit the MTP head
  • model still fits with room to spare

We just happen to have four, so four is what got benched.

The KV-cache angle is backwards here

The reason people usually cite for going multi-node — distributing a growing KV cache —
doesn’t apply. NemotronH is a hybrid: only 8 of 88 layers are attention (the other 80
are Mamba, carrying a fixed recurrent state), so KV barely grows (~4 KB/token) and 512 K
context is nearly free per request. The thing actually worth distributing is the Mamba
state pool
— and, as the single-node run shows, the aggregate KV budget once you want
long context across several requests at once.


So, to actually answer the question honestly: if you only ever do single requests, one
Spark is completely sufficient.
It boots, it serves, MTP works, and a lone request gets
the full 512 K context — the second node earns nothing for you. Four Sparks only start to
pay off the moment you want many requests in flight and long context at the same time:
that’s when the sharded Mamba + KV pools turn into real, usable headroom.

We have four, so the cluster just runs it — clean 8/8, MTP on, no babysitting. But
that was never the headline. The headline is that MTP works; the node count is just
where you land on the concurrency-vs-cost curve. One Spark for yourself, two for a
comfortable sweet spot, four if you’re serving a crowd.

I understand you use claude for the replies, but you could put some effort into it at least.

Actually, I did. Sorry to not have met your expecations.

I am using this model + MTP on 2 DGX Sparks with vllm. I’ll attach my recipe, but will be trying some of the suggestions from here as well. I am launching with the sparkrun project and a locally build vllm-node image from the spark-vllm-docker project. When I return from camping next week I’ll figure out how to host the recipe in a proper repo.

run.sh.txt (795 Bytes)

recipe.yaml.txt (1.6 KB)

Edit: I forgot to mention I modified the sparkrun-vllm-docker/Dockerfile with

ARG CUDA_IMAGE=nvidia/cuda:13.3.0-devel-ubuntu24.04