Hello,
This guide details the steps to set up a disaggregated serving system where prefill and decode phases run on separate GPUs, using LMCache https://lmcache.ai/ for efficient KV cache transfer and vLLM(GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs).
Hardware Setup: Dual NVIDIA RTX PRO 6000 Blackwell with Gigabyte TRX50 AI TOP motherboard and AMD Ryzen Threadripper 9960X
Install uv (a fast Python package installer):
curl -LsSf https://astral.sh/uv/install.sh | sh
Create a Virtual Environment:
sudo apt install python3-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate
Install vLLM:
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
Install LMCache:
git clone https://github.com/LMCache/LMCache.git
cd LMCache/
uv pip install -r requirements/build.txt
uv pip install -e . --no-build-isolation
Install NIXL: The NIXL wheel bundles supported backends (like UCX).
uv pip install nixl[cu13]
uv pip install ucxx-cu13
GPU Connectivity Check (P2P)
Before running the split workload, verify p2p access between the GPUs.
python3 -c "import torch; print(torch.cuda.can_device_access_peer(0, 1))"
Expected Output: True
Detailed Bandwidth & Latency Test (CUDA Samples):
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
mkdir build && cd build
cmake .. \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \
-DCMAKE_C_COMPILER=/usr/bin/gcc \
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-13.0 \
-DCMAKE_CUDA_ARCHITECTURES=120
make -j$(nproc)
Run the test:
/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1490.98 43.76
1 43.55 1525.93
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1488.14 55.46
1 55.86 1514.10
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1483.09 56.01
1 55.97 1503.08
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1483.09 106.14
1 106.64 1494.46
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.07 14.31
1 14.33 2.07
CPU 0 1
0 2.14 5.59
1 5.63 2.09
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.07 0.44
1 0.51 2.07
CPU 0 1
0 2.10 1.44
1 1.43 2.08
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
In this disaggregated setup:
- Prefill Instance (GPU 0): Processes the prompt and generates the initial KV cache. It is optimized for TTFT)
- Decode Instance (GPU 1): Receives the KV cache and generates subsequent tokens. It is optimized for TPOT.
- LMCache: Handles the transfer of the KV cache between the two instances via the network/interconnect.
Start the Prefill Instance (Producer) on GPU 0:
export PYTHONHASHSEED=123
export UCX_MEMTYPE_CACHE=n
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=lmcache-prefiller-config.yaml \
CUDA_VISIBLE_DEVICES=0 \
vllm serve Qwen/Qwen3-4B \
--port 7100 \
--disable-log-requests \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}'
Output:
(EngineCore_DP0 pid=52898) 2026-02-15 11:51:51 NIXL INFO _api.py:363 Backend UCX was instantiated
(EngineCore_DP0 pid=52898) 2026-02-15 11:51:51 NIXL INFO _api.py:253 Initialized NIXL agent: deacb458-be89-4039-8cb1-83217ed3cd99
Start the Decode Instance (Consumer) on GPU 1:
export PYTHONHASHSEED=123
export UCX_MEMTYPE_CACHE=n
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=lmcache-decoder-config.yaml \
CUDA_VISIBLE_DEVICES=1 \
vllm serve Qwen/Qwen3-4B \
--port 7200 \
--disable-log-requests \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}'
Output:
(EngineCore_DP0 pid=56606) 2026-02-15 11:59:24 NIXL INFO _api.py:363 Backend UCX was instantiated
(EngineCore_DP0 pid=56606) 2026-02-15 11:59:24 NIXL INFO _api.py:253 Initialized NIXL agent: a7f6b7d6-641c-47f6-8937-c530fcc4e781
Start the Disaggregation Proxy Server:
path: vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_proxy_server.py at main · vllm-project/vllm · GitHub
python3 disagg_proxy_server.py \
--host localhost \
--port 9100 \
--prefiller-host localhost \
--prefiller-port 7100 \
--decoder-host localhost \
--decoder-port 7200
The following benchmarks demonstrate the performance of the disaggregated setup.
Trial #1
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Maximum request concurrency: 100
Benchmark duration (s): 22.71
Total input tokens: 128000
Total generated tokens: 128000
Request throughput (req/s): 44.03
Output token throughput (tok/s): 5635.48
Peak output token throughput (tok/s): 8100.00
Peak concurrent requests: 200.00
Total token throughput (tok/s): 11270.96
---------------Time to First Token----------------
Mean TTFT (ms): 241.25
Median TTFT (ms): 111.67
P99 TTFT (ms): 1221.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.74
Median TPOT (ms): 16.27
P99 TPOT (ms): 16.82
---------------Inter-token Latency----------------
Mean ITL (ms): 15.74
Median ITL (ms): 12.71
P99 ITL (ms): 46.25
==================================================
Trial #2 Improved performance likely due to KV cache reusing
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Maximum request concurrency: 100
Benchmark duration (s): 19.76
Total input tokens: 128000
Total generated tokens: 128000
Request throughput (req/s): 50.61
Output token throughput (tok/s): 6478.60
Peak output token throughput (tok/s): 6971.00
Peak concurrent requests: 200.00
Total token throughput (tok/s): 12957.20
---------------Time to First Token----------------
Mean TTFT (ms): 135.93
Median TTFT (ms): 65.67
P99 TTFT (ms): 815.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.12
Median TPOT (ms): 14.67
P99 TPOT (ms): 14.85
---------------Inter-token Latency----------------
Mean ITL (ms): 14.12
Median ITL (ms): 13.48
P99 ITL (ms): 20.82
==================================================
If LLM prefill and LLM decode are performed on different GPUs, KV cache must be physically transferred from the prefill GPU to the decode GPU. Without optimization, network latency and limited bandwidth will immediately become system bottlenecks, degrading performance.
To make this disaggregation usefull, it is highly preferable (and often necessary) to utilize NVLink and RDMA networks.