Why 200 tok/s is new normal? — TP=2 Does Scale After All

The story so far … a hardware-limited little Spark — compute-bound, network-bound, memory-bound. Already pitched by NVIDIA themselves as “better buy two.” Lots of enthusiasts in the forums. The dead horse “NVFP4” praised as the big hit. Two nodes don’t make anything faster — they’re only good for running larger models? fresh $150M for the prisen llm framework vLLM… And somehow, nothing worked as it supposed to.

Time to dig ourselves? Deep into the rabbit hole? That deep?

I found everything needed to push token rate to just 100 tok/s (zero context, long token generation, single dgx). Nearly all of it was already implemented — but nobody wanted to put it together. Arch family patches here, wrong environment variables there. Sounds absurd. Was absurd. For a product that was marketed so well.

Today next narrative is flipped. TP=2 does scale after all. We achieve 200 tok/s through better orchestration of what’s already there. It shows that the many contributors already gave their best — but nobody picks it up. Not the device manufacturer. Not the well-funded frameworks. But how?

You may read [https://forums.developer.nvidia.com/t/why-273-gb-s-less-is-more-until-it-isn-t/] before.

Problem

GB10 has 1 GPU per node and 273 GB/s bandwidth. With Qwen3-Coder INT4, that’s enough for ~97 tok/s (TP=1). Can’t go higher with a single GPU.

Solution

Connect two GB10s (DGX Spark + PGX ThinkStation) via QSFP56 RoCE 200 Gbps. serve_torchrun.py — a custom server that uses PyTorch’s torchrun launcher instead of Ray.

Architecture

DGX Spark (Rank 0)                    PGX ThinkStation (Rank 1)
┌─────────────────────┐                ┌─────────────────────┐
│  HTTP API (:8011)   │                │                     │
│  ↓ Requests         │                │                     │
│  FastAPI Thread     │                │                     │
│  ↓                  │   GLOO (CPU)   │                     │
│  broadcast(request) │ ──────────────►│  receive(request)   │
│  ↓                  │                │  ↓                  │
│  engine.step()      │   NCCL (GPU)   │  engine.step()      │
│  half GEMMs ───────►│ ◄─AllReduce──► │◄── half GEMMs      │
│  ↓                  │   RoCE 200Gbps │  ↓                  │
│  Response → Client  │                │  (discarded)        │
└─────────────────────┘                └─────────────────────┘

Two Communication Layers

Layer Protocol Purpose Data
Control Plane GLOO (CPU, TCP) Request distribution JSON (~1 KB)
Data Plane NCCL (GPU, RoCE) AllReduce partial sums 4 KiB BF16 × 97/token

Why torchrun Instead of Ray?

  • Ray requires its own cluster daemon on each node
  • Ray’s multiprocessing had conflicts with GB10 Unified Memory
  • torchrun + external_launcher is more lightweight — PyTorch brings everything needed

Continuous Batching Protocol

# Rank 0 (HTTP + Engine)
while True:
    requests = drain_http_queue()
    broadcast(requests)           # GLOO → all ranks
    engine.step()                 # NCCL sync internally

# Rank 1+ (engine only)
while True:
    requests = receive_broadcast() # GLOO ← Rank 0
    engine.step()                  # NCCL sync internally

Both ranks call engine.step() in exact lockstep — NCCL AllReduce inside step() synchronizes the GPUs automatically.

Result

TP=1 (1 GPU):   97 tok/s  (limit of a single GB10)
TP=2 (2 GPUs): 108 tok/s  (only +11%, AllReduce overhead eats the gain)

The gain was only +11% instead of the expected ~2×, because 97 AllReduce calls per token eat up the budget. Profiling showed: 4.43ms/token for AllReduce = 48% of token time.


Breakthrough 2: UF17 EAGER_ALLREDUCE

Problem

NCCL AllReduce inside CUDA Graphs is 2.5× slower than raw.

AllReduce raw (eager):       18 µs  ← calling NCCL directly
AllReduce in CUDA Graph:     46 µs  ← NCCL has internal graph overhead
                             ━━━━
                             +28 µs overhead × 97 calls = 2.66ms/token (29%!)

CUDA Graphs are great for compute kernels (GEMMs, Attention) — they eliminate CPU launch overhead. But NCCL has internal bookkeeping overhead when running in a graph replay (rigid buffers, no dynamic channel selection).

Solution

A single line of code — register vllm::all_reduce as a “Splitting Op”:

# compilation.py
if os.environ.get("VLLM_UF_EAGER_ALLREDUCE", "0") == "1":
    self.splitting_ops.append("vllm::all_reduce")

vLLM’s Piecewise CUDA Graph architecture cuts the FX graph at splitting ops. GEMMs, Attention, RMSNorm stay in CUDA Graphs (no launch overhead), but AllReduce runs eager in between (18µs instead of 46µs).

Before:  [====== CUDA Graph (GEMMs + AllReduce + Norms) ======]
          AllReduce: 46µs × 97 = 4.43ms

After:   [= Graph =] AllReduce [= Graph =] AllReduce [= Graph =]
          18µs eager   18µs eager
          97 × 18µs + 97 × 5µs piecewise = 2.23ms

What’s Left on the Table

Theoretical optimum (AllReduce 18µs IN graph, no overhead):
  97 × 18µs + 0µs piecewise = 1.75ms

UF17 (AllReduce 18µs eager + piecewise overhead):
  97 × 18µs + 97 × ~5µs     = 2.24ms

Difference: ~0.5ms ≈ 10% of token time (5.1ms at 196 tok/s)

~0.5ms (10%) left on the table due to piecewise splits. This would only be recoverable through a fix in NCCL itself.

Result

TP=1:              97 tok/s  (baseline, 1 GPU)
TP=2 without UF17: 108 tok/s  (+11%, AllReduce overhead eats the gain)
TP=2 with UF17:    196 tok/s  (+102% vs TP=1, near-linear scaling!)

torchrun = enabler (makes TP=2 possible at all), UF17 = optimizer (eliminates the AllReduce overhead). Together: two GB10s are twice as fast as a single one — theoretically optimal TP=2 scaling.

Measured Baseline Data

NCCL 2.29.2, ConnectX-7 RoCE 200 Gbps, TP=2 (DGX + PGX)

AllReduce raw 4 KiB:   18.2 µs
CUDA Graph ×97/call:   45.6 µs (2.5× overhead)
97× total (graph):   4.43 ms = 48% of 9.28ms token time
Graph overhead:      2.66 ms = 29% of token time

Repo: github.com/flash7777/vllm-marlin-sm12x
Build: vllm-nextgen

10 Likes

Excellent, many thanks! I have noticed that you provide the recipes for serving via torch run a number of other models. How do they scale?

its in progress. i had tested qwen3 (working model), next are glm 4.7 flash and qwen coder next.

This is remarkable. Does it work only for Qwen3-Coder, or is it compatible with the well regarded Int4 Autoround Qwen-Coder-Next?

This is confusing, as TP=2 works with Ray too. Also, vLLM natively supports Torch Distributed backend, and e.g. @dbsci Sparkrun uses it by default, but I don’t think we’ve observed any performance gains with this setup.

Have you tested your AllReduce patch with Ray?

ray is slower then the continous implementation.

see RESULTS_CONTINUOUS.md

thank you. we will test.

Danke!

A few questions:

  1. Have you tried without speculative decoding? It’s just another variable in the way.
  2. Have you checked the actual model output? I tried to apply the trick you referenced above (the splitting_ops one) and the model started producing garbage (basically just “!!!”).
  3. Not sure why you have such a significant performance degradation with context. I don’t have the autoround variant of that model, but a regular qwen3-vl-30b doesn’t behave like this:
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ pp2048 8010.47 ± 366.31 262.92 ± 11.46 256.19 ± 11.46 263.02 ± 11.48
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ tg32 81.46 ± 1.39 84.12 ± 1.42
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ ctx_pp @ d4096 8621.55 ± 157.12 481.98 ± 8.76 475.25 ± 8.76 482.15 ± 8.71
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ ctx_tg @ d4096 81.32 ± 0.38 83.97 ± 0.40
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ pp2048 @ d4096 6285.17 ± 9.76 332.58 ± 0.51 325.85 ± 0.51 332.65 ± 0.50
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ tg32 @ d4096 78.14 ± 0.98 80.69 ± 1.01
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ ctx_pp @ d8192 7703.53 ± 500.47 1074.87 ± 72.73 1068.13 ± 72.73 1074.95 ± 72.72
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ ctx_tg @ d8192 70.49 ± 1.52 72.78 ± 1.57
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ pp2048 @ d8192 4974.56 ± 318.91 420.17 ± 27.22 413.44 ± 27.22 420.24 ± 27.22
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ tg32 @ d8192 75.01 ± 2.00 77.46 ± 2.05
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ ctx_pp @ d16384 6655.58 ± 69.53 2468.69 ± 25.68 2461.96 ± 25.68 2468.78 ± 25.69
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ ctx_tg @ d16384 66.06 ± 4.90 68.21 ± 5.06
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ pp2048 @ d16384 3826.95 ± 83.89 542.15 ± 11.91 535.41 ± 11.91 542.24 ± 11.89
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ tg32 @ d16384 60.72 ± 1.20 62.69 ± 1.25

llama-benchy (0.3.3)
date: 2026-03-01 17:36:08 | latency mode: api

yes. but not published yet. its the pure mass if test results.

i tested a lot, mainly qwen3 coder 30b. thats my reference model. at the moment i’m working on uf19 while the testbench iterates over the other models and tunes the parameters (by recombination). most of the results (these that make sense at least) are published half-automatically via github repos.

and ususally i test with my own test bench.py, published also in the repos. why not llama_benchy? doesnt show the big picture, doesnt show any accuracy degration. everything discussed in the last weeks. if you cant see, you cant change.

its important to mention that also these patches depend on the architecture of the model. as an example for this. when using ep or tp, each layer has his consolidation over the net, the more layers, the more transfers. each transfer costs a fraction of the scaling for both. its not clear how to declare this in tests results to make it comparable. i will extend bench.py wit nccl test, ib or somewhat.

everyone who uses a switch to connect the sparks will also have more latency. its very low grow in latemcy per transfer but if its part of N ops per token, it may count much more as you think.

and yes, i have ended testing new vllm versions if they have made one more fix by leaving unworking release on HEAD. Its just time I havent. It would be another iteration loop for combining every finding with every revisiom of vllm. therefore its still the revision i ve pinned once…

The current upstream covers only 2 of the UF-Mods due to stability issues.
build-ng and start scripts have been updated.

# Feature Patch/File Effekt
1 Marlin SM12x marlin_sm12x.patch W4A16 + W4A8-FP8 auf SM120/SM121
2 CUTLASS 4.3.5 /opt/cutlass (git clone) SM120a/SM121a GEMM Support
3 Patch 0 inline in Dockerfile assume_32bit_indexing (PyTorch 2.10 Bug)
4 Patches 3,4,7,8,11 patch_transformers.py transformers 5.0 + compressed-tensors Compat
5 Streaming Fix patch_streaming.py Anthropic API tool_calls Fix
6 MTP+NVFP4 patch_mtp_nvfp4_exclusion.py MTP Speculative Decoding mit NVFP4
7 UF17 EAGER_ALLREDUCE patch_uf17_eager_allreduce.py NCCL AllReduce außerhalb CUDA Graphs (+8%)
8 UF19v4 RDMA uf19_rdma.cu + patch_uf19_rdma_allreduce.py CUDA-Graph-kompatibler ibverbs AllReduce (-26% vs NCCL)
9 MoE Configs moe-configs/*.json Getunt für GB10 + RTX PRO 6000
10 serve_torchrun.py /opt/vllm/serve_torchrun.py Multi-Node TP Serving

Reference Qwen3 Coder 30B achieves 120 token/s on tp=2, zero context, long. GLM 4.7 flash only 80 token/s (eagle3 exists but moe doesnt support, Qwen3 Coder Next didnt start on tp=2 , because marlin needs output part size>=32.

And.. whats in the focus now too: the amount of layers of the model - the less, the faster.

Its getting more complex. uiuiui.

3 Likes

I’ve also updated nvdiai base image, vllm to 0.16rc2, transformers to 5.2.0dev0:

vllm-ng16 Benchmark Results (TP=2, UF17)

Model Quant Spec long tok/s Math
Qwen3-Coder-30B INT4 EAGLE3 NST=1 116.9 76%
GLM-4.7-Flash INT4 vanilla 73.3 100%
Qwen3-Coder-Next INT4 vanilla FAILED (TP=2)
GLM-4.7-Flash INT4 EAGLE3 106 100%
Qwen3-Coder-Next INT4 EAGLE3 87
StepFun 3.5 Flash INT4 vanilla 57

Comparison vllm-ng (0.15) vs vllm-ng16 (0.16)

  • Qwen3-Coder: 118.8 → 116.9 tok/s (−1.6%, within measurement noise)
  • GLM-4.7-Flash: 72.4 → 73.3 tok/s (+1.2%)
  • No performance regression with vLLM 0.16 + PyTorch 2.11
  • Bonus: assume_32bit_indexing patch no longer needed (native in PyTorch 2.11)
  • Qwen3-Next fails on Marlin min_thread_n=64 at TP=2 (same error as vllm-ng)

UPDATE: there is a BF16 weight for int4 autoround fix, that pushes GLM 4.7 flash on tp=2 up to 105 token/s

UPDATE. Qwen3 next works now, on tp=2 up to 87 token/s , eagle3 works now too, but existing models were for fp8/sglang optimized,

UPDATE: stepfun 3.5 tp=2 int4 ar, vanilla 57 token/s* (A11B Model)

1 Like

After switching to Ray for TP=2, I merged eugr’s config with the ng17 baseline established here into a combined ng17e, benchmarked against Qwen 3.5 397B INT4 AutoRound — with the conclusion that nothing currently runs better than this for Qwen on TP=2 with vLLM:

Metric eugr (without UF17) ng17e (with UF17) Delta
Short (20t) 24.2 24.4 +1%
Medium (150t) 27.9 28.2 +1%
Long (400t) 24.8 24.8 0%
Peak (ctx=0 long) 28.6 28.3 -1%
Math 96% 96% =
ctx=16K long 27.7 27.2 -2%

Result: Identical. UF17 provides no measurable speedup for this model/setup. The AllReduce overhead is too small relative to total compute time for 397B MoE (only 17B active params, minimal NCCL traffic per token).

The +82% UF17 effect was observed on Qwen3-Coder-30B (dense, high NCCL traffic per token). At 397B MoE with only 2 KV heads and sparse routing, NCCL is simply not the bottleneck.

Both images deliver 28 tok/s, 96% Math. ng17e has the advantage of additional patches (Marlin SM12x, MTP, Streaming) for other models. For 397B specifically, eugr’s image is sufficient.

UPDATE:

  • no MTP (autoround)
  • no Drafters available at the moment
1 Like