Disaggregated Prefill/Decode Split Using LMCache(Dual NVIDIA RTX PRO 6000 Blackwell)

Hello,

This guide details the steps to set up a disaggregated serving system where prefill and decode phases run on separate GPUs, using LMCache https://lmcache.ai/ for efficient KV cache transfer and vLLM(GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs).

Hardware Setup: Dual NVIDIA RTX PRO 6000 Blackwell with Gigabyte TRX50 AI TOP motherboard and AMD Ryzen Threadripper 9960X

Install uv (a fast Python package installer):

curl -LsSf https://astral.sh/uv/install.sh | sh

Create a Virtual Environment:

sudo apt install python3-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate

Install vLLM:

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130

Install LMCache:

git clone https://github.com/LMCache/LMCache.git
cd LMCache/
uv pip install -r requirements/build.txt
uv pip install -e . --no-build-isolation

Install NIXL: The NIXL wheel bundles supported backends (like UCX).

uv pip install nixl[cu13]
uv pip install ucxx-cu13

GPU Connectivity Check (P2P)

Before running the split workload, verify p2p access between the GPUs.

python3 -c "import torch; print(torch.cuda.can_device_access_peer(0, 1))"

Expected Output: True

Detailed Bandwidth & Latency Test (CUDA Samples):

git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
mkdir build && cd build

cmake .. \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \
  -DCMAKE_C_COMPILER=/usr/bin/gcc \
  -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-13.0 \
  -DCMAKE_CUDA_ARCHITECTURES=120

make -j$(nproc)

Run the test:

/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1490.98  43.76 
     1  43.55 1525.93 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 1488.14  55.46 
     1  55.86 1514.10 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1483.09  56.01 
     1  55.97 1503.08 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 1483.09 106.14 
     1 106.64 1494.46 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   2.07  14.31 
     1  14.33   2.07 

   CPU     0      1 
     0   2.14   5.59 
     1   5.63   2.09 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   2.07   0.44 
     1   0.51   2.07 

   CPU     0      1 
     0   2.10   1.44 
     1   1.43   2.08 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

In this disaggregated setup:

  • Prefill Instance (GPU 0): Processes the prompt and generates the initial KV cache. It is optimized for TTFT)
  • Decode Instance (GPU 1): Receives the KV cache and generates subsequent tokens. It is optimized for TPOT.
  • LMCache: Handles the transfer of the KV cache between the two instances via the network/interconnect.

Start the Prefill Instance (Producer) on GPU 0:

export PYTHONHASHSEED=123
export UCX_MEMTYPE_CACHE=n

UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=lmcache-prefiller-config.yaml \
CUDA_VISIBLE_DEVICES=0 \
vllm serve Qwen/Qwen3-4B \
    --port 7100 \
    --disable-log-requests \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}'

Output:

(EngineCore_DP0 pid=52898) 2026-02-15 11:51:51 NIXL INFO    _api.py:363 Backend UCX was instantiated
(EngineCore_DP0 pid=52898) 2026-02-15 11:51:51 NIXL INFO    _api.py:253 Initialized NIXL agent: deacb458-be89-4039-8cb1-83217ed3cd99

Start the Decode Instance (Consumer) on GPU 1:

export PYTHONHASHSEED=123
export UCX_MEMTYPE_CACHE=n

UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=lmcache-decoder-config.yaml \
CUDA_VISIBLE_DEVICES=1 \
vllm serve Qwen/Qwen3-4B \
    --port 7200 \
    --disable-log-requests \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}'

Output:

(EngineCore_DP0 pid=56606) 2026-02-15 11:59:24 NIXL INFO    _api.py:363 Backend UCX was instantiated
(EngineCore_DP0 pid=56606) 2026-02-15 11:59:24 NIXL INFO    _api.py:253 Initialized NIXL agent: a7f6b7d6-641c-47f6-8937-c530fcc4e781

Start the Disaggregation Proxy Server:
path: vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_proxy_server.py at main · vllm-project/vllm · GitHub

python3 disagg_proxy_server.py \
  --host localhost \
  --port 9100 \
  --prefiller-host localhost \
  --prefiller-port 7100 \
  --decoder-host localhost \
  --decoder-port 7200

The following benchmarks demonstrate the performance of the disaggregated setup.

Trial #1

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             100       
Benchmark duration (s):                  22.71     
Total input tokens:                      128000    
Total generated tokens:                  128000    
Request throughput (req/s):              44.03     
Output token throughput (tok/s):         5635.48   
Peak output token throughput (tok/s):    8100.00   
Peak concurrent requests:                200.00    
Total token throughput (tok/s):          11270.96  
---------------Time to First Token----------------
Mean TTFT (ms):                          241.25    
Median TTFT (ms):                        111.67    
P99 TTFT (ms):                           1221.01   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.74     
Median TPOT (ms):                        16.27     
P99 TPOT (ms):                           16.82     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.74     
Median ITL (ms):                         12.71     
P99 ITL (ms):                            46.25     
==================================================

Trial #2 Improved performance likely due to KV cache reusing

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             100       
Benchmark duration (s):                  19.76     
Total input tokens:                      128000    
Total generated tokens:                  128000    
Request throughput (req/s):              50.61     
Output token throughput (tok/s):         6478.60   
Peak output token throughput (tok/s):    6971.00   
Peak concurrent requests:                200.00    
Total token throughput (tok/s):          12957.20  
---------------Time to First Token----------------
Mean TTFT (ms):                          135.93    
Median TTFT (ms):                        65.67     
P99 TTFT (ms):                           815.20    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.12     
Median TPOT (ms):                        14.67     
P99 TPOT (ms):                           14.85     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.12     
Median ITL (ms):                         13.48     
P99 ITL (ms):                            20.82     
==================================================

If LLM prefill and LLM decode are performed on different GPUs, KV cache must be physically transferred from the prefill GPU to the decode GPU. Without optimization, network latency and limited bandwidth will immediately become system bottlenecks, degrading performance.
To make this disaggregation usefull, it is highly preferable (and often necessary) to utilize NVLink and RDMA networks.

1 Like

Are you using the same cards in both machines, or are you using a card on each computer?

Have you had a chance to test NVFP4 models, and if so, have you noticed any improvements? If you have, could you share how these improvements impact overall performance and KV cache performance?