Disaggregated Prefill/Decode using SGlang and nixl (Dual NVIDIA RTX PRO 6000 Blackwell)

Hello,

I wanted to share working setup for running SGLang in disaggregated prefill/decode mode , and the NIXL transfer backend. This is useful for maximizing throughput on multi-GPU systems like H100 clusters.

Create and activate a virtual environment

uv venv .sglang --python 3.12
source .sglang/bin/activate

Installs the main SGLang package.

uv pip install sglang

Install PyTorch with CUDA 13.0 support

uv pip install --force-reinstall \
  torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 \
  --index-url https://download.pytorch.org/whl/cu130

Install SGLang kernel with CUDA 13.0 wheels

uv pip install --force-reinstall sgl-kernel \
  --index-url https://docs.sglang.ai/whl/cu130/

Ensures compatibility with CUDA 13.0. --force-reinstall avoids conflicts with cached wheels.

Install NIXL with CUDA 13 support

uv pip install "nixl[cu13]"

Enables high-speed GPU-to-GPU transfer for disaggregation.

Install the SGLang router

uv pip install sglang-router

Launch Disaggregated Servers

Separates compute-heavy prefill from latency-sensitive decode stages across GPUs.

Prefill worker (GPU 0)

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-4B \
  --disaggregation-mode prefill \
  --port 30000 \
  --disaggregation-transfer-backend nixl

Decode worker (GPU 1)

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-4B \
  --disaggregation-mode decode \
  --port 30001 \
  --base-gpu-id 1 \
  --disaggregation-transfer-backend nixl

Start the router

python -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --host 0.0.0.0 \
  --port 8000

Test with OpenAI-Compatible Client

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

stream = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Be concise and accurate."},
        {"role": "user", "content": "Tell me about quantum computing"}
    ],
    stream=True,
    max_tokens=2000,
)

# Print tokens as they arrive
for chunk in stream:
    content = chunk.choices[0].delta.content or ""
    print(content, end="", flush=True)

print()  # Final newline

Output from decode instance:

[2026-02-22 12:50:43] Decode batch, #running-req: 1, #token: 780, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.11, #queue-req: 0, 
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 820, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.04, #queue-req: 0, 
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 860, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.42, #queue-req: 0, 
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 900, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.35, #queue-req: 0,

I evaluated disaggregated prefill and decode using SGLang with NIXL on a dual RTX PRO 6000 Blackwell setup. In this configuration, one GPU handles the prefill phase and the other handles decode, enforcing a fixed 1:1 ratio (1 Prefill : 1 Decode).

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model Qwen/Qwen3-4B \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100

Output


============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 100       
Successful requests:                     1000      
Benchmark duration (s):                  114.41    
Total input tokens:                      512842    
Total input text tokens:                 512842    
Total generated tokens:                  510855    
Total generated tokens (retokenized):    510839    
Request throughput (req/s):              8.74      
Input token throughput (tok/s):          4482.67   
Output token throughput (tok/s):         4465.30   
Peak output token throughput (tok/s):    6353.00   
Peak concurrent requests:                117       
Total token throughput (tok/s):          8947.96   
Concurrency:                             93.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10753.30  
Median E2E Latency (ms):                 10648.14  
P90 E2E Latency (ms):                    17957.11  
P99 E2E Latency (ms):                    22622.34  
---------------Time to First Token----------------
Mean TTFT (ms):                          3289.73   
Median TTFT (ms):                        2468.30   
P99 TTFT (ms):                           8982.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.54     
Median TPOT (ms):                        15.17     
P99 TPOT (ms):                           17.46     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.64     
Median ITL (ms):                         15.18     
P95 ITL (ms):                            18.14     
P99 ITL (ms):                            19.31     
Max ITL (ms):                            25.36     
==================================================

Comparison: Tensor Parallelism (TP=2)

I compared the above results with a standard TP=2 configuration using the same model and workload.

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-4B \
  --port 8000 \
  --tp 2

Output:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 100       
Successful requests:                     1000      
Benchmark duration (s):                  76.19     
Total input tokens:                      512842    
Total input text tokens:                 512842    
Total generated tokens:                  510855    
Total generated tokens (retokenized):    510833    
Request throughput (req/s):              13.13     
Input token throughput (tok/s):          6731.48   
Output token throughput (tok/s):         6705.40   
Peak output token throughput (tok/s):    8796.00   
Peak concurrent requests:                121       
Total token throughput (tok/s):          13436.89  
Concurrency:                             95.21     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7253.87   
Median E2E Latency (ms):                 7064.05   
P90 E2E Latency (ms):                    13515.42  
P99 E2E Latency (ms):                    14909.66  
---------------Time to First Token----------------
Mean TTFT (ms):                          125.31    
Median TTFT (ms):                        39.58     
P99 TTFT (ms):                           1350.39   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.25     
Median TPOT (ms):                        14.36     
P99 TPOT (ms):                           19.85     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.98     
Median ITL (ms):                         11.16     
P95 ITL (ms):                            27.90     
P99 ITL (ms):                            44.91     
Max ITL (ms):                            1285.39   
==================================================
==================================================

Dual RTX PRO 6000 Blackwell GPUs, standard Tensor Parallelism (TP=2) delivers significantly better throughput and latency than a 1:1 disaggregated prefill/decode setup.

The major bottleneck in the disaggregated configuration is KV cache transfer between GPUs. While NIXL enables this separation, moving KV cache data over PCIe introduces significant overhead compared to keeping data resident on the same GPU or sharing it efficiently via tensor parallelism. For multi-GPU systems, high-bandwidth interconnects such as NVLink are strongly recommended to minimize communication overhead between GPUs.