Multi-node DGX Spark + SGLang win: Gemma-4-31B + MTP — +80 % @ n=8 (153 tok/s) on 4× GB10

ht12 · May 16, 2026, 1:57pm

Following up on the earlier RoCE / NVFP4 multi-node thread
(Two multi-node DGX Spark wins: RoCE, 2× inference throughput, Qwen3.5-397B-A17B-NVFP4 serving with SM121 CUTLASS patch)
— we now have Gemma-4-31B-it (dense, BF16) running on SGLang with the new
*-assistant MTP drafter, on a 4-node DGX Spark cluster (GB10 / SM121),
TP=4 EP=1, NCCL over RoCE.

Sharing because most success reports for Gemma-4 + MTP so far have been
vLLM-side (Hospedales,
joshua.dale.warner,
EUGR),
and I haven’t seen the SGLang side documented end-to-end yet.

TL;DR

Concurrency	Baseline (no MTP)	MTP `num_steps=4`	MTP `num_steps=6`	Best Δ vs baseline
n=1	10.49 tok/s	22.09	26.68 ★	+154 %
n=4	44.06 tok/s	86.24	91.41 ★	+108 %
n=8	85.34 tok/s	153.24 ★	146.11	+80 %

Drafter: google/gemma-4-31B-it-assistant (4-layer auxiliary checkpoint, Apache-2.0, released 2026-05-05)
Path: SGLang detects Gemma4AssistantForCausalLM as drafter and auto-promotes --speculative-algorithm NEXTN → FROZEN_KV_MTP (recurrent hidden-state draft loop, frozen target KV cache).
Acceptance: median accept_rate ≈ 0.55 at num_steps=4, median accept_len ≈ 3.0/4 (drafter clears 3/4 of theoretical max per cycle).
Output quality: 62/62 successful requests across the MTP sweep finished on natural EOS (stop); no word-salad, no length cap hits.

Cluster / image

Component	Value
GPU	4× NVIDIA GB10 (SM121 / Blackwell), 128 GB per node
Driver	580.142
CUDA	13.2 host / 13.0 image (SGLang PR #21498)
Kernel	6.19.13-custom (Ubuntu 24.04 LTS aarch64)
Interconnect	200 GbE QSFP, ConnectX-7, MikroTik CRS812, RoCE via SR-IOV VF
K3s	v1.35.3+k3s1
SGLang	v0.5.11 + PR #24436 cherry-pick
NCCL	2.29.7+cuda13.2 (dgxspark-3node-ring)
sgl-kernel	0.4.2 / FlashInfer 0.6.8.post1 / PyTorch 2.11
Image	`xomoxcc/dgx-spark-sglang:0.5.11-gemma4-sm121`
Model	`google/gemma-4-31B-it` (text-only, native 256K context)

KV cache fp8_e4m3, mem_fraction_static=0.60, context_length=262144.
Model is dense 30.7 B (not MoE) so no MoE-runner sweep — only
attention × cuda_graph × speculative variants.

What was needed to get this running on SGLang

1) PR #24436 must be in your image — the v0.5.11 tag does not have it.

SGLang’s stock NEXTN/EAGLE worker loads the drafter via
AutoModel.from_config(...) and then does a model.language_model
weight surgery that doesn’t exist on Gemma4AssistantForCausalLM. First
attempt on 2026-05-12 crashed during model load with:

ValueError: No module or parameter named 'model.language_model'
            in TransformersMultiModalForCausalLM

A sitecustomize.py AutoModel-register stop-gap can paper over the
registration miss but not the weight-surgery path. The proper fix is
SGLang PR #24436 — “Gemma 4 — Adding MTP support”:
it adds a dedicated Gemma4AssistantForCausalLM model and the new
FROZEN_KV_MTP speculative algorithm. Merged 2026-05-07, after the v0.5.11
tag — so it has to be cherry-picked into the image.

2) Auto-promotion to FROZEN_KV_MTP is the intended path.

At runtime, with the cherry-pick in place, SGLang prints:

Detected Gemma4AssistantForCausalLM draft;
  promoting --speculative-algorithm NEXTN to FROZEN_KV_MTP
Overlap scheduler is disabled when using Frozen-KV MTP speculative decoding
  (spec v2 is not supported yet)
Capture Frozen-KV MTP draft cuda graph begin
Capture Frozen-KV MTP draft cuda graph end

You can leave speculative_algorithm: NEXTN and enable_spec_v2: true
in your config — both get overridden by the FROZEN_KV_MTP path.

3) Set speculative_num_draft_tokens manually — autoadjust doesn’t fire.

SGLang requires num_draft_tokens ≥ num_steps + 1 (each step contributes
one draft token plus the final accepted-token slot). The cookbook’s fixed
num_draft_tokens=6 only matches num_steps=5. For any other value you
have to bump it in lockstep — autoadjust didn’t kick in for us.

4) attention_backend: triton, not fi.

FlashInfer prefill still crashes on Gemma-4 head_dim=256 + RoPE=64:

FlashInfer Internal Error: Invalid configuration :
  NUM_MMA_Q=1 NUM_MMA_D_QK=32 NUM_MMA_D_VO=32 NUM_MMA_KV=1
  NUM_WARPS_Q=1 NUM_WARPS_KV=4
  (prefill.cuh:2978)

Same dispatch-table miss as on 0.5.10; FlashInfer 0.6.8.post1 + sgl-kernel
0.4.2 didn’t close it. The assert fires at the first decode call, not at
graph capture — even eager mode hits it. Workaround: attention_backend: triton
(this is the profile default; just don’t override it).

Sweep results

speculative_num_steps ∈ {2, 3, 4, 5, 6}, drafter
google/gemma-4-31B-it-assistant, winner shape fixed to triton-attn +
CUDA graphs on + piecewise on:

`num_steps`	`num_draft_tokens`	n=1 tok/s	n=4 peak	n=8 peak	Acceptance (median)
2	3	20.83	77.67	(skipped)	~0.68 (accept_len 2.4/2)
3	4	22.88	83.04	142.02	~0.55 (2.7/3)
4	5	22.09	86.24	153.24 ★	~0.52 (3.05/4)
5	6	23.40	88.27	149.73	~0.50 (~3.0/5)
6	7	26.68 ★	91.41 ★	146.11	~0.48 (~3.1/6)

Pattern: per-step acceptance rate keeps dropping as you add steps
(0.68 → 0.55 → 0.52 → 0.50 → 0.48), but the absolute accepted-token
count per verify cycle keeps climbing (2.4 → 2.7 → 3.05 → …). Net
throughput wins until verify-batch overhead at high concurrency starts
eating the drafter gain — which happens for us past num_steps=4 at n=8.

Sweet spots:

n=1 / n=4 (single-stream, agent workloads): num_steps=6 — 26.7 / 91.4 tok/s (curve still climbing at 6, num_steps=7 might push further)
n=8 (concurrent serving): num_steps=4 — 153.24 tok/s, +80 % over the no-MTP baseline
Mixed deployment: 6→4 costs 7–9 tok/s at n=1/n=4 but buys 5–10 % at n=8. Pick by your traffic shape.

Production config (SGLang launch args / dgxarley profile)

attention_backend: triton                   # fi-attn crashes on head_dim=256+RoPE=64
disable_cuda_graph: false
disable_piecewise_cuda_graph: false
kv_cache_dtype: fp8_e4m3
mem_fraction_static: 0.60
context_length: 262144
nccl_transport: roce
cuda_graph_max_bs: 8

speculative_enabled: true
speculative_algorithm: NEXTN                # auto-promoted to FROZEN_KV_MTP at runtime
speculative_draft_model_path: google/gemma-4-31B-it-assistant
speculative_num_steps: 4                    # 153 tok/s at n=8 (+80% vs baseline)
speculative_num_draft_tokens: 5             # = num_steps + 1; autoadjust doesn't fire
speculative_eagle_topk: 1
enable_spec_v2: true                        # auto-disabled by FROZEN_KV_MTP path

For single-stream / agent traffic, switch to num_steps=6 /
num_draft_tokens=7 → 26.7 tok/s n=1 (vs 23.4 at num_steps=4).

Output quality

All 62 successful MTP requests checked:

0 finish=length (every response stopped on natural EOS within the 3072-token cap)
Output-token range 939 → 1730 (median ~1380)
Tail-eyeball + grep for triple-word repetitions / self-correction markers
(self-correct / stop rambling / thinking thinking / retire retire): clean.
Coherence profile identical to the no-MTP baseline runs.

The verify path is lossless, as it should be — drafter only proposes,
target accepts/rejects.

Where this sits vs the vLLM reports

For context, the published vLLM numbers on single Spark + NVFP4A16:

Hospedales: ~104 → ~175 tok/s peak with MTP, 2.34× wall-clock, 67–69 % acceptance
joshua.dale.warner: 30–40 tok/s “most workloads”, 1.2 M-token KV cache

Ours (SGLang, 4× Spark BF16, TP=4 EP=1): 85 → 153 tok/s @ n=8, 1.80×.
Lower peak than Hospedales’ single-Spark NVFP4A16 mostly because we’re
still on BF16 — Gemma-4-31B dense is bandwidth-bound, so the extra GPUs
mostly burn NCCL overhead instead of helping. Once
SGLang Gemma-4 NVFP4 support
(PR #22615,
#22927,
#22928,
#22929) lands, we
expect to close most of that gap.

Links

Full test log (matrix of 11 cases, per-request tables, acceptance traces, all decode batch logs): TESTLOGS/sglang_nn4_tp4_ep1/gemma-4-31b-it/TESTLOG_nv580.142_sglang-0.5.11_gemma-4-31b-it_4n.md
Cluster setup / RoCE / SM121 patches: vroomfondel/dgxarley
SGLang PR #24436 (Gemma-4 MTP support / FROZEN_KV_MTP algorithm): [Gemma 4] Adding MTP support by kpham-sgl · Pull Request #24436 · sgl-project/sglang · GitHub
Drafter: google/gemma-4-31B-it-assistant
Earlier multi-node Spark post (RoCE / NVFP4 / Qwen3.5-397B): Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch)

Happy to share more detail on any of: the PR #24436 cherry-pick patch,
the FROZEN_KV_MTP runtime logs, the per-test acceptance distributions,
or the RoCE / SR-IOV / NAD setup the cluster runs on.

Topic		Replies	Views
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	790	April 16, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	39	2104	April 20, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1984	December 7, 2025
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	2380	April 7, 2026
Sovgrid.org My non-dev’s engineering log on DGX Spark DGX Spark / GB10 Projects spark , llm	1	196	May 1, 2026
Running Mistral Small 4 (119B MoE) on DGX Spark with SGLang — Full Setup & Benchmarks DGX Spark / GB10 agentic-ai	9	1128	May 20, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1898	April 28, 2026
Gemma4 Benchmarks on double DGX Sparks Ray Cluster and single DGX DGX Spark / GB10 Projects	2	695	April 6, 2026
Build SGLang from source on Blackwell Pro 6000/ DGX Spark DGX Spark / GB10 jetson , nemotron	13	1089	February 18, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	27	4257	January 2, 2026