Why does Cluster TP=2 beat 2× Solo + LiteLLM proxy at low/mid concurrency?

Hey everyone, looking for a sanity check on some benchmark results that don’t match my expectations.

The setup:

  • 2× DGX Spark (Grace Blackwell, 128 GB unified each), CX7 200GbE stacking link

  • Model: GPT-OSS-120B, mxfp4 quantization (vLLM 0.1.dev12777)

  • Recipe: gpu_memory_utilization=0.70, max_model_len=4096, --enforce-eager

  • Benchmark: llama-benchy, pp1024/tg128, 50 requests per concurrency level

  • All requests go through LiteLLM proxy in every topology (solo, cluster, 2×Solo)

  • All 4 runs same day, same model, same image

What I tested:

  1. Solo - single node, single vLLM instance

  2. Cluster TP=2 - both nodes as one vLLM instance via tensor parallelism

  3. 2×Solo + LiteLLM simple-shuffle - two independent vLLM instances, LiteLLM proxy with random 50/50 routing

  4. 2×Solo + LiteLLM least-busy - same but with least-busy routing

My expectation: 2×Solo behind a proxy should always produce higher total throughput than cluster TP=2. Two independent engines, each handling their own requests, no cross-GPU synchronization overhead. Double the compute, double the throughput.

What actually happened - cluster wins up to c16-c32, proxy only wins at c64:

Decode throughput (tg128, tokens/s total):

Concurrency Solo (1 node) Cluster TP=2 2×Solo shuffle 2×Solo least-busy
c1 57.5 69.0 57.7 57.5
c2 78.4 104.1 99.4 87.7
c4 107.9 155.3 127.9 122.3
c8 153.5 231.0 177.5 162.5
c16 218.5 342.1 268.6 238.5
c32 318.7 471.7 382.1 333.2
c64 315.2 471.8 567.9 338.4

Prefill throughput (pp1024, tokens/s total):

Concurrency Solo (1 node) Cluster TP=2 2×Solo shuffle 2×Solo least-busy
c1 2926 4442 2922 3074
c2 3684 5428 4669 4074
c4 4540 6343 5578 5299
c8 5905 7965 7483 6579
c16 6858 8816 10072 8886
c32 6827 8966 11741 9753
c64 2753 3873 12064 4982

TTFT (ms, lower = better):

Concurrency Solo (1 node) Cluster TP=2 2×Solo shuffle 2×Solo least-busy
c1 369 251 370 364
c2 570 381 459 526
c4 898 640 675 735
c8 1385 1020 1005 1232
c16 2384 1855 1495 1690
c32 4768 3642 2551 3303
c64 10874 8032 4940 7360

You’re right - at c1 only one node is working in the proxy setup, so it’s apples to oranges. Let me use c8 instead, where both nodes are clearly busy:


Where I’m confused:

Take c8: the 2×Solo+shuffle setup gives 177.5 t/s decode. Each node is handling ~4 concurrent requests, both GPUs are busy - this is the scenario where having two independent engines should shine. But cluster TP=2 gives 231.0 t/s - 30% faster total throughput, even though it has cross-GPU synchronization overhead.

This pattern holds all the way through c32 (cluster 471.7 vs shuffle 382.1, +23%). The proxy setup only overtakes cluster at c64 (567.9 vs 471.8).

My mental model was “two independent engines = 2× the request capacity = higher total throughput.” But TP=2 apparently makes each individual request fast enough that even with NCCL coordination costs, total throughput is higher until the synchronization bottleneck kicks in at very high concurrency.

My question: Is this expected? Specifically:

  • Should cluster TP=2 over CX7 genuinely produce higher total throughput than 2× independent nodes up to c32?

  • Or is something in my 2×Solo setup (vLLM config, LiteLLM overhead, proxy latency) leaving performance on the table?

I highly suggest everyone stop using LiteLLM immediately

LiteLLM is a fine solution for a LLM Gateway. The (supply chain) attack they were hit with doesn’t insinuate that they have bad security hygiene. Supply chain attacks can be a challenge to defend against and technically the root cause of their compromise wasn’t bad code, it was by using a compromised vulnerability scanner in their GitHub Actions CI/CD. This has more implications on the use of GitHub Actions than it does on LiteLLM. They actually reacted very quickly and provided the community some great transparency, which isn’t common today from companies after a compromise.

I would amend your statement to be I would suggest everyone stop using LiteLLM versions 1.82.7 and 1.82.8 immediately and upgrade to known safe versions.

Writing off a project entirely after a supply chain attack isn’t realistic, eventually every widely used project will experience a security incident and you’ll have no one left to go with.

Apologies I didn’t answer the question!

I haven’t gone to replicate your results, so I’m assuming that your results are valid, etc., but what you’re seeing is believable. The important detail is that Spark only has 273 GB/s of unified-memory bandwidth, which is relatively on the low side, and that makes decode much more memory/batching sensitive.

In the TP=2 setup, a single vLLM scheduler sees the full request stream and can build larger continuous batches. In the 2×Solo case, the same traffic is split across two engines, so each backend often runs smaller decode batches. On a bandwidth-constrained system, that can reduce effective memory-system utilization enough that TP=2 behaves as if it has higher usable bandwidth, even though the raw hardware bandwidth hasn’t changed. (And for TP=2, since you’ve sharded the weights, each device only needs to read a portion of the model weights for the tensor-parallel matmuls, which can improve how much “useful work” you get per second even after accounting for communication.)

That’s why TP=2 can still win overall at moderate concurrency: better batching and faster per-request progress outweigh the clustering overhead. So it’s plausible for TP=2 to beat two independent nodes up to something like c32 with light context length, decode-heavy workloads, with the independent setup only pulling ahead once communication and scheduling overheads dominate.

That’s (one reason) why the spark has ConnectX-7 and why we make Spark clusters. :-)

Thanks for the heads‑up, my LiteLLM is pinned to v1.82.6.

Hi @dbsci, thank you for answer it helped me understand that i need to learn more and go deeper in this rabbit hole :-) I found few interesting articles that explains topics of paralelism batching etc, maybe it will be useful also for someone else :