Why 200 tok/s is new normal? — TP=2 Does Scale After All

flash3 · March 1, 2026, 4:54pm

The story so far … a hardware-limited little Spark — compute-bound, network-bound, memory-bound. Already pitched by NVIDIA themselves as “better buy two.” Lots of enthusiasts in the forums. The dead horse “NVFP4” praised as the big hit. Two nodes don’t make anything faster — they’re only good for running larger models? fresh $150M for the prisen llm framework vLLM… And somehow, nothing worked as it supposed to.

Time to dig ourselves? Deep into the rabbit hole? That deep?

I found everything needed to push token rate to just 100 tok/s (zero context, long token generation, single dgx). Nearly all of it was already implemented — but nobody wanted to put it together. Arch family patches here, wrong environment variables there. Sounds absurd. Was absurd. For a product that was marketed so well.

Today next narrative is flipped. TP=2 does scale after all. We achieve 200 tok/s through better orchestration of what’s already there. It shows that the many contributors already gave their best — but nobody picks it up. Not the device manufacturer. Not the well-funded frameworks. But how?

You may read [https://forums.developer.nvidia.com/t/why-273-gb-s-less-is-more-until-it-isn-t/] before.

Problem

GB10 has 1 GPU per node and 273 GB/s bandwidth. With Qwen3-Coder INT4, that’s enough for ~97 tok/s (TP=1). Can’t go higher with a single GPU.

Solution

Connect two GB10s (DGX Spark + PGX ThinkStation) via QSFP56 RoCE 200 Gbps. serve_torchrun.py — a custom server that uses PyTorch’s torchrun launcher instead of Ray.

Architecture

DGX Spark (Rank 0)                    PGX ThinkStation (Rank 1)
┌─────────────────────┐                ┌─────────────────────┐
│  HTTP API (:8011)   │                │                     │
│  ↓ Requests         │                │                     │
│  FastAPI Thread     │                │                     │
│  ↓                  │   GLOO (CPU)   │                     │
│  broadcast(request) │ ──────────────►│  receive(request)   │
│  ↓                  │                │  ↓                  │
│  engine.step()      │   NCCL (GPU)   │  engine.step()      │
│  half GEMMs ───────►│ ◄─AllReduce──► │◄── half GEMMs      │
│  ↓                  │   RoCE 200Gbps │  ↓                  │
│  Response → Client  │                │  (discarded)        │
└─────────────────────┘                └─────────────────────┘

Two Communication Layers

Layer	Protocol	Purpose	Data
Control Plane	GLOO (CPU, TCP)	Request distribution	JSON (~1 KB)
Data Plane	NCCL (GPU, RoCE)	AllReduce partial sums	4 KiB BF16 × 97/token

Why torchrun Instead of Ray?

Ray requires its own cluster daemon on each node
Ray’s multiprocessing had conflicts with GB10 Unified Memory
torchrun + external_launcher is more lightweight — PyTorch brings everything needed

Continuous Batching Protocol

# Rank 0 (HTTP + Engine)
while True:
    requests = drain_http_queue()
    broadcast(requests)           # GLOO → all ranks
    engine.step()                 # NCCL sync internally

# Rank 1+ (engine only)
while True:
    requests = receive_broadcast() # GLOO ← Rank 0
    engine.step()                  # NCCL sync internally

Both ranks call engine.step() in exact lockstep — NCCL AllReduce inside step() synchronizes the GPUs automatically.

Result

TP=1 (1 GPU):   97 tok/s  (limit of a single GB10)
TP=2 (2 GPUs): 108 tok/s  (only +11%, AllReduce overhead eats the gain)

The gain was only +11% instead of the expected ~2×, because 97 AllReduce calls per token eat up the budget. Profiling showed: 4.43ms/token for AllReduce = 48% of token time.

Breakthrough 2: UF17 EAGER_ALLREDUCE

Problem

NCCL AllReduce inside CUDA Graphs is 2.5× slower than raw.

AllReduce raw (eager):       18 µs  ← calling NCCL directly
AllReduce in CUDA Graph:     46 µs  ← NCCL has internal graph overhead
                             ━━━━
                             +28 µs overhead × 97 calls = 2.66ms/token (29%!)

CUDA Graphs are great for compute kernels (GEMMs, Attention) — they eliminate CPU launch overhead. But NCCL has internal bookkeeping overhead when running in a graph replay (rigid buffers, no dynamic channel selection).

Solution

A single line of code — register vllm::all_reduce as a “Splitting Op”:

# compilation.py
if os.environ.get("VLLM_UF_EAGER_ALLREDUCE", "0") == "1":
    self.splitting_ops.append("vllm::all_reduce")

vLLM’s Piecewise CUDA Graph architecture cuts the FX graph at splitting ops. GEMMs, Attention, RMSNorm stay in CUDA Graphs (no launch overhead), but AllReduce runs eager in between (18µs instead of 46µs).

Before:  [====== CUDA Graph (GEMMs + AllReduce + Norms) ======]
          AllReduce: 46µs × 97 = 4.43ms

After:   [= Graph =] AllReduce [= Graph =] AllReduce [= Graph =]
          18µs eager   18µs eager
          97 × 18µs + 97 × 5µs piecewise = 2.23ms

What’s Left on the Table

Theoretical optimum (AllReduce 18µs IN graph, no overhead):
  97 × 18µs + 0µs piecewise = 1.75ms

UF17 (AllReduce 18µs eager + piecewise overhead):
  97 × 18µs + 97 × ~5µs     = 2.24ms

Difference: ~0.5ms ≈ 10% of token time (5.1ms at 196 tok/s)

~0.5ms (10%) left on the table due to piecewise splits. This would only be recoverable through a fix in NCCL itself.

Result

TP=1:              97 tok/s  (baseline, 1 GPU)
TP=2 without UF17: 108 tok/s  (+11%, AllReduce overhead eats the gain)
TP=2 with UF17:    196 tok/s  (+102% vs TP=1, near-linear scaling!)

torchrun = enabler (makes TP=2 possible at all), UF17 = optimizer (eliminates the AllReduce overhead). Together: two GB10s are twice as fast as a single one — theoretically optimal TP=2 scaling.

Measured Baseline Data

NCCL 2.29.2, ConnectX-7 RoCE 200 Gbps, TP=2 (DGX + PGX)

AllReduce raw 4 KiB:   18.2 µs
CUDA Graph ×97/call:   45.6 µs (2.5× overhead)
97× total (graph):   4.43 ms = 48% of 9.28ms token time
Graph overhead:      2.66 ms = 29% of token time

Repo: github.com/flash7777/vllm-marlin-sm12x
Build: vllm-nextgen

adg1 · March 1, 2026, 5:09pm

Excellent, many thanks! I have noticed that you provide the recipes for serving via torch run a number of other models. How do they scale?

flash3 · March 1, 2026, 5:13pm

its in progress. i had tested qwen3 (working model), next are glm 4.7 flash and qwen coder next.

jwarner · March 1, 2026, 9:03pm

This is remarkable. Does it work only for Qwen3-Coder, or is it compatible with the well regarded Int4 Autoround Qwen-Coder-Next?

eugr · March 1, 2026, 9:09pm

This is confusing, as TP=2 works with Ray too. Also, vLLM natively supports Torch Distributed backend, and e.g. @dbsci Sparkrun uses it by default, but I don’t think we’ve observed any performance gains with this setup.

Have you tested your AllReduce patch with Ray?

flash3 · March 1, 2026, 11:38pm

ray is slower then the continous implementation.

see RESULTS_CONTINUOUS.md

relc · March 2, 2026, 12:14am

thank you. we will test.

Danke!

eugr · March 2, 2026, 1:40am

A few questions:

Have you tried without speculative decoding? It’s just another variable in the way.
Have you checked the actual model output? I tried to apply the trick you referenced above (the splitting_ops one) and the model started producing garbage (basically just “!!!”).
Not sure why you have such a significant performance degradation with context. I don’t have the autoround variant of that model, but a regular qwen3-vl-30b doesn’t behave like this:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	pp2048	8010.47 ± 366.31		262.92 ± 11.46	256.19 ± 11.46	263.02 ± 11.48
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	tg32	81.46 ± 1.39	84.12 ± 1.42
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	ctx_pp @ d4096	8621.55 ± 157.12		481.98 ± 8.76	475.25 ± 8.76	482.15 ± 8.71
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	ctx_tg @ d4096	81.32 ± 0.38	83.97 ± 0.40
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	pp2048 @ d4096	6285.17 ± 9.76		332.58 ± 0.51	325.85 ± 0.51	332.65 ± 0.50
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	tg32 @ d4096	78.14 ± 0.98	80.69 ± 1.01
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	ctx_pp @ d8192	7703.53 ± 500.47		1074.87 ± 72.73	1068.13 ± 72.73	1074.95 ± 72.72
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	ctx_tg @ d8192	70.49 ± 1.52	72.78 ± 1.57
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	pp2048 @ d8192	4974.56 ± 318.91		420.17 ± 27.22	413.44 ± 27.22	420.24 ± 27.22
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	tg32 @ d8192	75.01 ± 2.00	77.46 ± 2.05
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	ctx_pp @ d16384	6655.58 ± 69.53		2468.69 ± 25.68	2461.96 ± 25.68	2468.78 ± 25.69
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	ctx_tg @ d16384	66.06 ± 4.90	68.21 ± 5.06
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	pp2048 @ d16384	3826.95 ± 83.89		542.15 ± 11.91	535.41 ± 11.91	542.24 ± 11.89
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	tg32 @ d16384	60.72 ± 1.20	62.69 ± 1.25

llama-benchy (0.3.3)
date: 2026-03-01 17:36:08 | latency mode: api

flash3 · March 2, 2026, 7:48am

yes. but not published yet. its the pure mass if test results.

i tested a lot, mainly qwen3 coder 30b. thats my reference model. at the moment i’m working on uf19 while the testbench iterates over the other models and tunes the parameters (by recombination). most of the results (these that make sense at least) are published half-automatically via github repos.

and ususally i test with my own test bench.py, published also in the repos. why not llama_benchy? doesnt show the big picture, doesnt show any accuracy degration. everything discussed in the last weeks. if you cant see, you cant change.

its important to mention that also these patches depend on the architecture of the model. as an example for this. when using ep or tp, each layer has his consolidation over the net, the more layers, the more transfers. each transfer costs a fraction of the scaling for both. its not clear how to declare this in tests results to make it comparable. i will extend bench.py wit nccl test, ib or somewhat.

everyone who uses a switch to connect the sparks will also have more latency. its very low grow in latemcy per transfer but if its part of N ops per token, it may count much more as you think.

and yes, i have ended testing new vllm versions if they have made one more fix by leaving unworking release on HEAD. Its just time I havent. It would be another iteration loop for combining every finding with every revisiom of vllm. therefore its still the revision i ve pinned once…

flash3 · March 4, 2026, 1:11pm

The current upstream covers only 2 of the UF-Mods due to stability issues.
build-ng and start scripts have been updated.

#	Feature	Patch/File	Effekt
1	Marlin SM12x	marlin_sm12x.patch	W4A16 + W4A8-FP8 auf SM120/SM121
2	CUTLASS 4.3.5	/opt/cutlass (git clone)	SM120a/SM121a GEMM Support
3	Patch 0	inline in Dockerfile	assume_32bit_indexing (PyTorch 2.10 Bug)
4	Patches 3,4,7,8,11	patch_transformers.py	transformers 5.0 + compressed-tensors Compat
5	Streaming Fix	patch_streaming.py	Anthropic API tool_calls Fix
6	MTP+NVFP4	patch_mtp_nvfp4_exclusion.py	MTP Speculative Decoding mit NVFP4
7	UF17 EAGER_ALLREDUCE	patch_uf17_eager_allreduce.py	NCCL AllReduce außerhalb CUDA Graphs (+8%)
8	UF19v4 RDMA	uf19_rdma.cu + patch_uf19_rdma_allreduce.py	CUDA-Graph-kompatibler ibverbs AllReduce (-26% vs NCCL)
9	MoE Configs	moe-configs/*.json	Getunt für GB10 + RTX PRO 6000
10	serve_torchrun.py	/opt/vllm/serve_torchrun.py	Multi-Node TP Serving

Reference Qwen3 Coder 30B achieves 120 token/s on tp=2, zero context, long. GLM 4.7 flash only 80 token/s (eagle3 exists but moe doesnt support, Qwen3 Coder Next didnt start on tp=2 , because marlin needs output part size>=32.

And.. whats in the focus now too: the amount of layers of the model - the less, the faster.

Its getting more complex. uiuiui.

flash3 · March 4, 2026, 6:21pm

I’ve also updated nvdiai base image, vllm to 0.16rc2, transformers to 5.2.0dev0:

vllm-ng16 Benchmark Results (TP=2, UF17)

Model	Quant	Spec	long tok/s	Math
Qwen3-Coder-30B	INT4	EAGLE3 NST=1	116.9	76%
GLM-4.7-Flash	INT4	vanilla	73.3	100%
Qwen3-Coder-Next	INT4	vanilla	FAILED (TP=2)	—
GLM-4.7-Flash	INT4	EAGLE3	106	100%
Qwen3-Coder-Next	INT4	EAGLE3	87	—
StepFun 3.5 Flash	INT4	vanilla	57	—

Comparison vllm-ng (0.15) vs vllm-ng16 (0.16)

Qwen3-Coder: 118.8 → 116.9 tok/s (−1.6%, within measurement noise)
GLM-4.7-Flash: 72.4 → 73.3 tok/s (+1.2%)
No performance regression with vLLM 0.16 + PyTorch 2.11
Bonus: assume_32bit_indexing patch no longer needed (native in PyTorch 2.11)
Qwen3-Next fails on Marlin min_thread_n=64 at TP=2 (same error as vllm-ng)

UPDATE: there is a BF16 weight for int4 autoround fix, that pushes GLM 4.7 flash on tp=2 up to 105 token/s

UPDATE. Qwen3 next works now, on tp=2 up to 87 token/s , eagle3 works now too, but existing models were for fp8/sglang optimized,

UPDATE: stepfun 3.5 tp=2 int4 ar, vanilla 57 token/s* (A11B Model)

flash3 · March 15, 2026, 12:39am

After switching to Ray for TP=2, I merged eugr’s config with the ng17 baseline established here into a combined ng17e, benchmarked against Qwen 3.5 397B INT4 AutoRound — with the conclusion that nothing currently runs better than this for Qwen on TP=2 with vLLM:

Metric	eugr (without UF17)	ng17e (with UF17)	Delta
Short (20t)	24.2	24.4	+1%
Medium (150t)	27.9	28.2	+1%
Long (400t)	24.8	24.8	0%
Peak (ctx=0 long)	28.6	28.3	-1%
Math	96%	96%	=
ctx=16K long	27.7	27.2	-2%

Result: Identical. UF17 provides no measurable speedup for this model/setup. The AllReduce overhead is too small relative to total compute time for 397B MoE (only 17B active params, minimal NCCL traffic per token).

The +82% UF17 effect was observed on Qwen3-Coder-30B (dense, high NCCL traffic per token). At 397B MoE with only 2 KV heads and sparse routing, NCCL is simply not the bottleneck.

Both images deliver 28 tok/s, 96% Math. ng17e has the advantage of additional patches (Marlin SM12x, MTP, Streaming) for other models. For 397B specifically, eugr’s image is sufficient.

UPDATE:

no MTP (autoround)
no Drafters available at the moment

eugr · March 17, 2026, 5:06pm

I think you got it backwards :)

Qwen3-Coder-30B is also sparse MoE model. But since it has only 3B active parameters, it is much more sensitive to network overhead, since each layer takes much less time to process, compared to 17B active parameters of 397B model.

flash3 · March 17, 2026, 9:47pm

different topic… what do you think, is

quant after reap = reap after quant ?

Quant is int4 autoround here

arctic.gus · March 17, 2026, 10:06pm

In my experience reap models are not worth bothering with - lower quant of a full model has outperformed reap’ed model in a higher quant.

eugr · March 17, 2026, 10:33pm

I agree with @arctic.gus - I feel that REAP is hard to do right. But answering your question, probably not much difference with good calibration. Assuming the question is whether to perform reap before quantization or after.

flash3 · March 19, 2026, 10:41am

Yes, as long as you can. the screenshot shows the activation frequency and routing weight magnitude (contribution) of the “hi” prompt on glm 4.7 flash.

serapis · March 19, 2026, 2:31pm

Sorry to go off-topic here – but what is this amazing tool?

flash3 · March 19, 2026, 2:39pm

the live monitor of “Reap It Yourself” extension of vllm. the color is Routing-Weight-Magnitude, the pattern shows the activation frequency.

After one “hallo” and “wer war mozart?”:

[vllm/README.riy.md at riy · flash7777/vllm · GitHub]

serapis · March 19, 2026, 3:00pm

Very cool. Thank you!

Topic		Replies	Views
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1746	April 28, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5283	March 16, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	399	14505	May 10, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2419	December 25, 2025
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1613	January 7, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5144	December 9, 2025
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9859	April 9, 2026
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	622	April 16, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1385	February 13, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	237	18471	May 10, 2026

Why 200 tok/s is new normal? — TP=2 Does Scale After All

Problem

Solution

Architecture

Two Communication Layers

Why torchrun Instead of Ray?

Continuous Batching Protocol

Result

Breakthrough 2: UF17 EAGER_ALLREDUCE

Problem

Solution

What’s Left on the Table

Result

Measured Baseline Data

vllm-ng16 Benchmark Results (TP=2, UF17)

Related topics