The story so far … a hardware-limited little Spark — compute-bound, network-bound, memory-bound. Already pitched by NVIDIA themselves as “better buy two.” Lots of enthusiasts in the forums. The dead horse “NVFP4” praised as the big hit. Two nodes don’t make anything faster — they’re only good for running larger models? fresh $150M for the prisen llm framework vLLM… And somehow, nothing worked as it supposed to.
Time to dig ourselves? Deep into the rabbit hole? That deep?
I found everything needed to push token rate to just 100 tok/s (zero context, long token generation, single dgx). Nearly all of it was already implemented — but nobody wanted to put it together. Arch family patches here, wrong environment variables there. Sounds absurd. Was absurd. For a product that was marketed so well.
Today next narrative is flipped. TP=2 does scale after all. We achieve 200 tok/s through better orchestration of what’s already there. It shows that the many contributors already gave their best — but nobody picks it up. Not the device manufacturer. Not the well-funded frameworks. But how?
You may read [https://forums.developer.nvidia.com/t/why-273-gb-s-less-is-more-until-it-isn-t/] before.
Problem
GB10 has 1 GPU per node and 273 GB/s bandwidth. With Qwen3-Coder INT4, that’s enough for ~97 tok/s (TP=1). Can’t go higher with a single GPU.
Solution
Connect two GB10s (DGX Spark + PGX ThinkStation) via QSFP56 RoCE 200 Gbps. serve_torchrun.py — a custom server that uses PyTorch’s torchrun launcher instead of Ray.
Architecture
DGX Spark (Rank 0) PGX ThinkStation (Rank 1)
┌─────────────────────┐ ┌─────────────────────┐
│ HTTP API (:8011) │ │ │
│ ↓ Requests │ │ │
│ FastAPI Thread │ │ │
│ ↓ │ GLOO (CPU) │ │
│ broadcast(request) │ ──────────────►│ receive(request) │
│ ↓ │ │ ↓ │
│ engine.step() │ NCCL (GPU) │ engine.step() │
│ half GEMMs ───────►│ ◄─AllReduce──► │◄── half GEMMs │
│ ↓ │ RoCE 200Gbps │ ↓ │
│ Response → Client │ │ (discarded) │
└─────────────────────┘ └─────────────────────┘
Two Communication Layers
| Layer | Protocol | Purpose | Data |
|---|---|---|---|
| Control Plane | GLOO (CPU, TCP) | Request distribution | JSON (~1 KB) |
| Data Plane | NCCL (GPU, RoCE) | AllReduce partial sums | 4 KiB BF16 × 97/token |
Why torchrun Instead of Ray?
- Ray requires its own cluster daemon on each node
- Ray’s multiprocessing had conflicts with GB10 Unified Memory
torchrun+external_launcheris more lightweight — PyTorch brings everything needed
Continuous Batching Protocol
# Rank 0 (HTTP + Engine)
while True:
requests = drain_http_queue()
broadcast(requests) # GLOO → all ranks
engine.step() # NCCL sync internally
# Rank 1+ (engine only)
while True:
requests = receive_broadcast() # GLOO ← Rank 0
engine.step() # NCCL sync internally
Both ranks call engine.step() in exact lockstep — NCCL AllReduce inside step() synchronizes the GPUs automatically.
Result
TP=1 (1 GPU): 97 tok/s (limit of a single GB10)
TP=2 (2 GPUs): 108 tok/s (only +11%, AllReduce overhead eats the gain)
The gain was only +11% instead of the expected ~2×, because 97 AllReduce calls per token eat up the budget. Profiling showed: 4.43ms/token for AllReduce = 48% of token time.
Breakthrough 2: UF17 EAGER_ALLREDUCE
Problem
NCCL AllReduce inside CUDA Graphs is 2.5× slower than raw.
AllReduce raw (eager): 18 µs ← calling NCCL directly
AllReduce in CUDA Graph: 46 µs ← NCCL has internal graph overhead
━━━━
+28 µs overhead × 97 calls = 2.66ms/token (29%!)
CUDA Graphs are great for compute kernels (GEMMs, Attention) — they eliminate CPU launch overhead. But NCCL has internal bookkeeping overhead when running in a graph replay (rigid buffers, no dynamic channel selection).
Solution
A single line of code — register vllm::all_reduce as a “Splitting Op”:
# compilation.py
if os.environ.get("VLLM_UF_EAGER_ALLREDUCE", "0") == "1":
self.splitting_ops.append("vllm::all_reduce")
vLLM’s Piecewise CUDA Graph architecture cuts the FX graph at splitting ops. GEMMs, Attention, RMSNorm stay in CUDA Graphs (no launch overhead), but AllReduce runs eager in between (18µs instead of 46µs).
Before: [====== CUDA Graph (GEMMs + AllReduce + Norms) ======]
AllReduce: 46µs × 97 = 4.43ms
After: [= Graph =] AllReduce [= Graph =] AllReduce [= Graph =]
18µs eager 18µs eager
97 × 18µs + 97 × 5µs piecewise = 2.23ms
What’s Left on the Table
Theoretical optimum (AllReduce 18µs IN graph, no overhead):
97 × 18µs + 0µs piecewise = 1.75ms
UF17 (AllReduce 18µs eager + piecewise overhead):
97 × 18µs + 97 × ~5µs = 2.24ms
Difference: ~0.5ms ≈ 10% of token time (5.1ms at 196 tok/s)
~0.5ms (10%) left on the table due to piecewise splits. This would only be recoverable through a fix in NCCL itself.
Result
TP=1: 97 tok/s (baseline, 1 GPU)
TP=2 without UF17: 108 tok/s (+11%, AllReduce overhead eats the gain)
TP=2 with UF17: 196 tok/s (+102% vs TP=1, near-linear scaling!)
torchrun = enabler (makes TP=2 possible at all), UF17 = optimizer (eliminates the AllReduce overhead). Together: two GB10s are twice as fast as a single one — theoretically optimal TP=2 scaling.
Measured Baseline Data
NCCL 2.29.2, ConnectX-7 RoCE 200 Gbps, TP=2 (DGX + PGX)
AllReduce raw 4 KiB: 18.2 µs
CUDA Graph ×97/call: 45.6 µs (2.5× overhead)
97× total (graph): 4.43 ms = 48% of 9.28ms token time
Graph overhead: 2.66 ms = 29% of token time
Repo: github.com/flash7777/vllm-marlin-sm12x
Build: vllm-nextgen