Hey everyone, looking for a sanity check on some benchmark results that don’t match my expectations.
The setup:
-
2× DGX Spark (Grace Blackwell, 128 GB unified each), CX7 200GbE stacking link
-
Model: GPT-OSS-120B, mxfp4 quantization (vLLM 0.1.dev12777)
-
Recipe:
gpu_memory_utilization=0.70,max_model_len=4096,--enforce-eager -
Benchmark: llama-benchy, pp1024/tg128, 50 requests per concurrency level
-
All requests go through LiteLLM proxy in every topology (solo, cluster, 2×Solo)
-
All 4 runs same day, same model, same image
What I tested:
-
Solo - single node, single vLLM instance
-
Cluster TP=2 - both nodes as one vLLM instance via tensor parallelism
-
2×Solo + LiteLLM simple-shuffle - two independent vLLM instances, LiteLLM proxy with random 50/50 routing
-
2×Solo + LiteLLM least-busy - same but with least-busy routing
My expectation: 2×Solo behind a proxy should always produce higher total throughput than cluster TP=2. Two independent engines, each handling their own requests, no cross-GPU synchronization overhead. Double the compute, double the throughput.
What actually happened - cluster wins up to c16-c32, proxy only wins at c64:
Decode throughput (tg128, tokens/s total):
| Concurrency | Solo (1 node) | Cluster TP=2 | 2×Solo shuffle | 2×Solo least-busy |
|---|---|---|---|---|
| c1 | 57.5 | 69.0 | 57.7 | 57.5 |
| c2 | 78.4 | 104.1 | 99.4 | 87.7 |
| c4 | 107.9 | 155.3 | 127.9 | 122.3 |
| c8 | 153.5 | 231.0 | 177.5 | 162.5 |
| c16 | 218.5 | 342.1 | 268.6 | 238.5 |
| c32 | 318.7 | 471.7 | 382.1 | 333.2 |
| c64 | 315.2 | 471.8 | 567.9 | 338.4 |
Prefill throughput (pp1024, tokens/s total):
| Concurrency | Solo (1 node) | Cluster TP=2 | 2×Solo shuffle | 2×Solo least-busy |
|---|---|---|---|---|
| c1 | 2926 | 4442 | 2922 | 3074 |
| c2 | 3684 | 5428 | 4669 | 4074 |
| c4 | 4540 | 6343 | 5578 | 5299 |
| c8 | 5905 | 7965 | 7483 | 6579 |
| c16 | 6858 | 8816 | 10072 | 8886 |
| c32 | 6827 | 8966 | 11741 | 9753 |
| c64 | 2753 | 3873 | 12064 | 4982 |
TTFT (ms, lower = better):
| Concurrency | Solo (1 node) | Cluster TP=2 | 2×Solo shuffle | 2×Solo least-busy |
|---|---|---|---|---|
| c1 | 369 | 251 | 370 | 364 |
| c2 | 570 | 381 | 459 | 526 |
| c4 | 898 | 640 | 675 | 735 |
| c8 | 1385 | 1020 | 1005 | 1232 |
| c16 | 2384 | 1855 | 1495 | 1690 |
| c32 | 4768 | 3642 | 2551 | 3303 |
| c64 | 10874 | 8032 | 4940 | 7360 |
You’re right - at c1 only one node is working in the proxy setup, so it’s apples to oranges. Let me use c8 instead, where both nodes are clearly busy:
Where I’m confused:
Take c8: the 2×Solo+shuffle setup gives 177.5 t/s decode. Each node is handling ~4 concurrent requests, both GPUs are busy - this is the scenario where having two independent engines should shine. But cluster TP=2 gives 231.0 t/s - 30% faster total throughput, even though it has cross-GPU synchronization overhead.
This pattern holds all the way through c32 (cluster 471.7 vs shuffle 382.1, +23%). The proxy setup only overtakes cluster at c64 (567.9 vs 471.8).
My mental model was “two independent engines = 2× the request capacity = higher total throughput.” But TP=2 apparently makes each individual request fast enough that even with NCCL coordination costs, total throughput is higher until the synchronization bottleneck kicks in at very high concurrency.
My question: Is this expected? Specifically:
-
Should cluster TP=2 over CX7 genuinely produce higher total throughput than 2× independent nodes up to c32?
-
Or is something in my 2×Solo setup (vLLM config, LiteLLM overhead, proxy latency) leaving performance on the table?