Multi-node DGX Spark + SGLang win: Gemma-4-31B + MTP — +80 % @ n=8 (153 tok/s) on 4× GB10

Following up on the earlier RoCE / NVFP4 multi-node thread
(Two multi-node DGX Spark wins: RoCE, 2× inference throughput, Qwen3.5-397B-A17B-NVFP4 serving with SM121 CUTLASS patch)
— we now have Gemma-4-31B-it (dense, BF16) running on SGLang with the new
*-assistant MTP drafter, on a 4-node DGX Spark cluster (GB10 / SM121),
TP=4 EP=1, NCCL over RoCE.

Sharing because most success reports for Gemma-4 + MTP so far have been
vLLM-side (Hospedales,
joshua.dale.warner,
EUGR),
and I haven’t seen the SGLang side documented end-to-end yet.


TL;DR

Concurrency Baseline (no MTP) MTP num_steps=4 MTP num_steps=6 Best Δ vs baseline
n=1 10.49 tok/s 22.09 26.68 ★ +154 %
n=4 44.06 tok/s 86.24 91.41 ★ +108 %
n=8 85.34 tok/s 153.24 ★ 146.11 +80 %
  • Drafter: google/gemma-4-31B-it-assistant (4-layer auxiliary checkpoint, Apache-2.0, released 2026-05-05)
  • Path: SGLang detects Gemma4AssistantForCausalLM as drafter and auto-promotes --speculative-algorithm NEXTNFROZEN_KV_MTP (recurrent hidden-state draft loop, frozen target KV cache).
  • Acceptance: median accept_rate ≈ 0.55 at num_steps=4, median accept_len ≈ 3.0/4 (drafter clears 3/4 of theoretical max per cycle).
  • Output quality: 62/62 successful requests across the MTP sweep finished on natural EOS (stop); no word-salad, no length cap hits.

Cluster / image

Component Value
GPU 4× NVIDIA GB10 (SM121 / Blackwell), 128 GB per node
Driver 580.142
CUDA 13.2 host / 13.0 image (SGLang PR #21498)
Kernel 6.19.13-custom (Ubuntu 24.04 LTS aarch64)
Interconnect 200 GbE QSFP, ConnectX-7, MikroTik CRS812, RoCE via SR-IOV VF
K3s v1.35.3+k3s1
SGLang v0.5.11 + PR #24436 cherry-pick
NCCL 2.29.7+cuda13.2 (dgxspark-3node-ring)
sgl-kernel 0.4.2 / FlashInfer 0.6.8.post1 / PyTorch 2.11
Image xomoxcc/dgx-spark-sglang:0.5.11-gemma4-sm121
Model google/gemma-4-31B-it (text-only, native 256K context)

KV cache fp8_e4m3, mem_fraction_static=0.60, context_length=262144.
Model is dense 30.7 B (not MoE) so no MoE-runner sweep — only
attention × cuda_graph × speculative variants.


What was needed to get this running on SGLang

1) PR #24436 must be in your image — the v0.5.11 tag does not have it.

SGLang’s stock NEXTN/EAGLE worker loads the drafter via
AutoModel.from_config(...) and then does a model.language_model
weight surgery that doesn’t exist on Gemma4AssistantForCausalLM. First
attempt on 2026-05-12 crashed during model load with:

ValueError: No module or parameter named 'model.language_model'
            in TransformersMultiModalForCausalLM

A sitecustomize.py AutoModel-register stop-gap can paper over the
registration miss but not the weight-surgery path. The proper fix is
SGLang PR #24436 — “Gemma 4 — Adding MTP support”:
it adds a dedicated Gemma4AssistantForCausalLM model and the new
FROZEN_KV_MTP speculative algorithm. Merged 2026-05-07, after the v0.5.11
tag — so it has to be cherry-picked into the image.

2) Auto-promotion to FROZEN_KV_MTP is the intended path.

At runtime, with the cherry-pick in place, SGLang prints:

Detected Gemma4AssistantForCausalLM draft;
  promoting --speculative-algorithm NEXTN to FROZEN_KV_MTP
Overlap scheduler is disabled when using Frozen-KV MTP speculative decoding
  (spec v2 is not supported yet)
Capture Frozen-KV MTP draft cuda graph begin
Capture Frozen-KV MTP draft cuda graph end

You can leave speculative_algorithm: NEXTN and enable_spec_v2: true
in your config — both get overridden by the FROZEN_KV_MTP path.

3) Set speculative_num_draft_tokens manually — autoadjust doesn’t fire.

SGLang requires num_draft_tokens ≥ num_steps + 1 (each step contributes
one draft token plus the final accepted-token slot). The cookbook’s fixed
num_draft_tokens=6 only matches num_steps=5. For any other value you
have to bump it in lockstep — autoadjust didn’t kick in for us.

4) attention_backend: triton, not fi.

FlashInfer prefill still crashes on Gemma-4 head_dim=256 + RoPE=64:

FlashInfer Internal Error: Invalid configuration :
  NUM_MMA_Q=1 NUM_MMA_D_QK=32 NUM_MMA_D_VO=32 NUM_MMA_KV=1
  NUM_WARPS_Q=1 NUM_WARPS_KV=4
  (prefill.cuh:2978)

Same dispatch-table miss as on 0.5.10; FlashInfer 0.6.8.post1 + sgl-kernel
0.4.2 didn’t close it. The assert fires at the first decode call, not at
graph capture — even eager mode hits it. Workaround: attention_backend: triton
(this is the profile default; just don’t override it).


Sweep results

speculative_num_steps ∈ {2, 3, 4, 5, 6}, drafter
google/gemma-4-31B-it-assistant, winner shape fixed to triton-attn +
CUDA graphs on + piecewise on:

num_steps num_draft_tokens n=1 tok/s n=4 peak n=8 peak Acceptance (median)
2 3 20.83 77.67 (skipped) ~0.68 (accept_len 2.4/2)
3 4 22.88 83.04 142.02 ~0.55 (2.7/3)
4 5 22.09 86.24 153.24 ★ ~0.52 (3.05/4)
5 6 23.40 88.27 149.73 ~0.50 (~3.0/5)
6 7 26.68 ★ 91.41 ★ 146.11 ~0.48 (~3.1/6)

Pattern: per-step acceptance rate keeps dropping as you add steps
(0.68 → 0.55 → 0.52 → 0.50 → 0.48), but the absolute accepted-token
count per verify cycle keeps climbing (2.4 → 2.7 → 3.05 → …). Net
throughput wins until verify-batch overhead at high concurrency starts
eating the drafter gain — which happens for us past num_steps=4 at n=8.

Sweet spots:

  • n=1 / n=4 (single-stream, agent workloads): num_steps=6 — 26.7 / 91.4 tok/s (curve still climbing at 6, num_steps=7 might push further)
  • n=8 (concurrent serving): num_steps=4 — 153.24 tok/s, +80 % over the no-MTP baseline
  • Mixed deployment: 6→4 costs 7–9 tok/s at n=1/n=4 but buys 5–10 % at n=8. Pick by your traffic shape.

Production config (SGLang launch args / dgxarley profile)

attention_backend: triton                   # fi-attn crashes on head_dim=256+RoPE=64
disable_cuda_graph: false
disable_piecewise_cuda_graph: false
kv_cache_dtype: fp8_e4m3
mem_fraction_static: 0.60
context_length: 262144
nccl_transport: roce
cuda_graph_max_bs: 8

speculative_enabled: true
speculative_algorithm: NEXTN                # auto-promoted to FROZEN_KV_MTP at runtime
speculative_draft_model_path: google/gemma-4-31B-it-assistant
speculative_num_steps: 4                    # 153 tok/s at n=8 (+80% vs baseline)
speculative_num_draft_tokens: 5             # = num_steps + 1; autoadjust doesn't fire
speculative_eagle_topk: 1
enable_spec_v2: true                        # auto-disabled by FROZEN_KV_MTP path

For single-stream / agent traffic, switch to num_steps=6 /
num_draft_tokens=7 → 26.7 tok/s n=1 (vs 23.4 at num_steps=4).


Output quality

All 62 successful MTP requests checked:

  • 0 finish=length (every response stopped on natural EOS within the 3072-token cap)
  • Output-token range 939 → 1730 (median ~1380)
  • Tail-eyeball + grep for triple-word repetitions / self-correction markers
    (self-correct / stop rambling / thinking thinking / retire retire): clean.
  • Coherence profile identical to the no-MTP baseline runs.

The verify path is lossless, as it should be — drafter only proposes,
target accepts/rejects.


Where this sits vs the vLLM reports

For context, the published vLLM numbers on single Spark + NVFP4A16:

  • Hospedales: ~104 → ~175 tok/s peak with MTP, 2.34× wall-clock, 67–69 % acceptance
  • joshua.dale.warner: 30–40 tok/s “most workloads”, 1.2 M-token KV cache

Ours (SGLang, 4× Spark BF16, TP=4 EP=1): 85 → 153 tok/s @ n=8, 1.80×.
Lower peak than Hospedales’ single-Spark NVFP4A16 mostly because we’re
still on BF16 — Gemma-4-31B dense is bandwidth-bound, so the extra GPUs
mostly burn NCCL overhead instead of helping. Once
SGLang Gemma-4 NVFP4 support
(PR #22615,
#22927,
#22928,
#22929) lands, we
expect to close most of that gap.


Links

Happy to share more detail on any of: the PR #24436 cherry-pick patch,
the FROZEN_KV_MTP runtime logs, the per-test acceptance distributions,
or the RoCE / SR-IOV / NAD setup the cluster runs on.

3 Likes