Following up on the earlier RoCE / NVFP4 multi-node thread
(Two multi-node DGX Spark wins: RoCE, 2× inference throughput, Qwen3.5-397B-A17B-NVFP4 serving with SM121 CUTLASS patch)
— we now have Gemma-4-31B-it (dense, BF16) running on SGLang with the new
*-assistant MTP drafter, on a 4-node DGX Spark cluster (GB10 / SM121),
TP=4 EP=1, NCCL over RoCE.
Sharing because most success reports for Gemma-4 + MTP so far have been
vLLM-side (Hospedales,
joshua.dale.warner,
EUGR),
and I haven’t seen the SGLang side documented end-to-end yet.
TL;DR
| Concurrency | Baseline (no MTP) | MTP num_steps=4 |
MTP num_steps=6 |
Best Δ vs baseline |
|---|---|---|---|---|
| n=1 | 10.49 tok/s | 22.09 | 26.68 ★ | +154 % |
| n=4 | 44.06 tok/s | 86.24 | 91.41 ★ | +108 % |
| n=8 | 85.34 tok/s | 153.24 ★ | 146.11 | +80 % |
- Drafter:
google/gemma-4-31B-it-assistant(4-layer auxiliary checkpoint, Apache-2.0, released 2026-05-05) - Path: SGLang detects
Gemma4AssistantForCausalLMas drafter and auto-promotes--speculative-algorithm NEXTN→FROZEN_KV_MTP(recurrent hidden-state draft loop, frozen target KV cache). - Acceptance: median
accept_rate ≈ 0.55atnum_steps=4, medianaccept_len ≈ 3.0/4(drafter clears 3/4 of theoretical max per cycle). - Output quality: 62/62 successful requests across the MTP sweep finished on natural EOS (
stop); no word-salad, nolengthcap hits.
Cluster / image
| Component | Value |
|---|---|
| GPU | 4× NVIDIA GB10 (SM121 / Blackwell), 128 GB per node |
| Driver | 580.142 |
| CUDA | 13.2 host / 13.0 image (SGLang PR #21498) |
| Kernel | 6.19.13-custom (Ubuntu 24.04 LTS aarch64) |
| Interconnect | 200 GbE QSFP, ConnectX-7, MikroTik CRS812, RoCE via SR-IOV VF |
| K3s | v1.35.3+k3s1 |
| SGLang | v0.5.11 + PR #24436 cherry-pick |
| NCCL | 2.29.7+cuda13.2 (dgxspark-3node-ring) |
| sgl-kernel | 0.4.2 / FlashInfer 0.6.8.post1 / PyTorch 2.11 |
| Image | xomoxcc/dgx-spark-sglang:0.5.11-gemma4-sm121 |
| Model | google/gemma-4-31B-it (text-only, native 256K context) |
KV cache fp8_e4m3, mem_fraction_static=0.60, context_length=262144.
Model is dense 30.7 B (not MoE) so no MoE-runner sweep — only
attention × cuda_graph × speculative variants.
What was needed to get this running on SGLang
1) PR #24436 must be in your image — the v0.5.11 tag does not have it.
SGLang’s stock NEXTN/EAGLE worker loads the drafter via
AutoModel.from_config(...) and then does a model.language_model
weight surgery that doesn’t exist on Gemma4AssistantForCausalLM. First
attempt on 2026-05-12 crashed during model load with:
ValueError: No module or parameter named 'model.language_model'
in TransformersMultiModalForCausalLM
A sitecustomize.py AutoModel-register stop-gap can paper over the
registration miss but not the weight-surgery path. The proper fix is
SGLang PR #24436 — “Gemma 4 — Adding MTP support”:
it adds a dedicated Gemma4AssistantForCausalLM model and the new
FROZEN_KV_MTP speculative algorithm. Merged 2026-05-07, after the v0.5.11
tag — so it has to be cherry-picked into the image.
2) Auto-promotion to FROZEN_KV_MTP is the intended path.
At runtime, with the cherry-pick in place, SGLang prints:
Detected Gemma4AssistantForCausalLM draft;
promoting --speculative-algorithm NEXTN to FROZEN_KV_MTP
Overlap scheduler is disabled when using Frozen-KV MTP speculative decoding
(spec v2 is not supported yet)
Capture Frozen-KV MTP draft cuda graph begin
Capture Frozen-KV MTP draft cuda graph end
You can leave speculative_algorithm: NEXTN and enable_spec_v2: true
in your config — both get overridden by the FROZEN_KV_MTP path.
3) Set speculative_num_draft_tokens manually — autoadjust doesn’t fire.
SGLang requires num_draft_tokens ≥ num_steps + 1 (each step contributes
one draft token plus the final accepted-token slot). The cookbook’s fixed
num_draft_tokens=6 only matches num_steps=5. For any other value you
have to bump it in lockstep — autoadjust didn’t kick in for us.
4) attention_backend: triton, not fi.
FlashInfer prefill still crashes on Gemma-4 head_dim=256 + RoPE=64:
FlashInfer Internal Error: Invalid configuration :
NUM_MMA_Q=1 NUM_MMA_D_QK=32 NUM_MMA_D_VO=32 NUM_MMA_KV=1
NUM_WARPS_Q=1 NUM_WARPS_KV=4
(prefill.cuh:2978)
Same dispatch-table miss as on 0.5.10; FlashInfer 0.6.8.post1 + sgl-kernel
0.4.2 didn’t close it. The assert fires at the first decode call, not at
graph capture — even eager mode hits it. Workaround: attention_backend: triton
(this is the profile default; just don’t override it).
Sweep results
speculative_num_steps ∈ {2, 3, 4, 5, 6}, drafter
google/gemma-4-31B-it-assistant, winner shape fixed to triton-attn +
CUDA graphs on + piecewise on:
num_steps |
num_draft_tokens |
n=1 tok/s | n=4 peak | n=8 peak | Acceptance (median) |
|---|---|---|---|---|---|
| 2 | 3 | 20.83 | 77.67 | (skipped) | ~0.68 (accept_len 2.4/2) |
| 3 | 4 | 22.88 | 83.04 | 142.02 | ~0.55 (2.7/3) |
| 4 | 5 | 22.09 | 86.24 | 153.24 ★ | ~0.52 (3.05/4) |
| 5 | 6 | 23.40 | 88.27 | 149.73 | ~0.50 (~3.0/5) |
| 6 | 7 | 26.68 ★ | 91.41 ★ | 146.11 | ~0.48 (~3.1/6) |
Pattern: per-step acceptance rate keeps dropping as you add steps
(0.68 → 0.55 → 0.52 → 0.50 → 0.48), but the absolute accepted-token
count per verify cycle keeps climbing (2.4 → 2.7 → 3.05 → …). Net
throughput wins until verify-batch overhead at high concurrency starts
eating the drafter gain — which happens for us past num_steps=4 at n=8.
Sweet spots:
- n=1 / n=4 (single-stream, agent workloads):
num_steps=6— 26.7 / 91.4 tok/s (curve still climbing at 6,num_steps=7might push further) - n=8 (concurrent serving):
num_steps=4— 153.24 tok/s, +80 % over the no-MTP baseline - Mixed deployment: 6→4 costs 7–9 tok/s at n=1/n=4 but buys 5–10 % at n=8. Pick by your traffic shape.
Production config (SGLang launch args / dgxarley profile)
attention_backend: triton # fi-attn crashes on head_dim=256+RoPE=64
disable_cuda_graph: false
disable_piecewise_cuda_graph: false
kv_cache_dtype: fp8_e4m3
mem_fraction_static: 0.60
context_length: 262144
nccl_transport: roce
cuda_graph_max_bs: 8
speculative_enabled: true
speculative_algorithm: NEXTN # auto-promoted to FROZEN_KV_MTP at runtime
speculative_draft_model_path: google/gemma-4-31B-it-assistant
speculative_num_steps: 4 # 153 tok/s at n=8 (+80% vs baseline)
speculative_num_draft_tokens: 5 # = num_steps + 1; autoadjust doesn't fire
speculative_eagle_topk: 1
enable_spec_v2: true # auto-disabled by FROZEN_KV_MTP path
For single-stream / agent traffic, switch to num_steps=6 /
num_draft_tokens=7 → 26.7 tok/s n=1 (vs 23.4 at num_steps=4).
Output quality
All 62 successful MTP requests checked:
- 0 finish=length (every response stopped on natural EOS within the 3072-token cap)
- Output-token range 939 → 1730 (median ~1380)
- Tail-eyeball + grep for triple-word repetitions / self-correction markers
(self-correct/stop rambling/thinking thinking/retire retire): clean. - Coherence profile identical to the no-MTP baseline runs.
The verify path is lossless, as it should be — drafter only proposes,
target accepts/rejects.
Where this sits vs the vLLM reports
For context, the published vLLM numbers on single Spark + NVFP4A16:
- Hospedales: ~104 → ~175 tok/s peak with MTP, 2.34× wall-clock, 67–69 % acceptance
- joshua.dale.warner: 30–40 tok/s “most workloads”, 1.2 M-token KV cache
Ours (SGLang, 4× Spark BF16, TP=4 EP=1): 85 → 153 tok/s @ n=8, 1.80×.
Lower peak than Hospedales’ single-Spark NVFP4A16 mostly because we’re
still on BF16 — Gemma-4-31B dense is bandwidth-bound, so the extra GPUs
mostly burn NCCL overhead instead of helping. Once
SGLang Gemma-4 NVFP4 support
(PR #22615,
#22927,
#22928,
#22929) lands, we
expect to close most of that gap.
Links
- Full test log (matrix of 11 cases, per-request tables, acceptance traces, all
decode batchlogs):TESTLOGS/sglang_nn4_tp4_ep1/gemma-4-31b-it/TESTLOG_nv580.142_sglang-0.5.11_gemma-4-31b-it_4n.md - Cluster setup / RoCE / SM121 patches: vroomfondel/dgxarley
- SGLang PR #24436 (Gemma-4 MTP support /
FROZEN_KV_MTPalgorithm): [Gemma 4] Adding MTP support by kpham-sgl · Pull Request #24436 · sgl-project/sglang · GitHub - Drafter:
google/gemma-4-31B-it-assistant - Earlier multi-node Spark post (RoCE / NVFP4 / Qwen3.5-397B): Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch)
Happy to share more detail on any of: the PR #24436 cherry-pick patch,
the FROZEN_KV_MTP runtime logs, the per-test acceptance distributions,
or the RoCE / SR-IOV / NAD setup the cluster runs on.