Disaggregated Prefill/Decode using SGlang and nixl (Dual NVIDIA RTX PRO 6000 Blackwell)

shahizat · February 22, 2026, 7:52am

Hello,

I wanted to share working setup for running SGLang in disaggregated prefill/decode mode , and the NIXL transfer backend. This is useful for maximizing throughput on multi-GPU systems like H100 clusters.

Create and activate a virtual environment

uv venv .sglang --python 3.12
source .sglang/bin/activate

Installs the main SGLang package.

uv pip install sglang

Install PyTorch with CUDA 13.0 support

uv pip install --force-reinstall \
  torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 \
  --index-url https://download.pytorch.org/whl/cu130

Install SGLang kernel with CUDA 13.0 wheels

uv pip install --force-reinstall sgl-kernel \
  --index-url https://docs.sglang.ai/whl/cu130/

Ensures compatibility with CUDA 13.0. --force-reinstall avoids conflicts with cached wheels.

Install NIXL with CUDA 13 support

uv pip install "nixl[cu13]"

Enables high-speed GPU-to-GPU transfer for disaggregation.

Install the SGLang router

uv pip install sglang-router

Launch Disaggregated Servers

Separates compute-heavy prefill from latency-sensitive decode stages across GPUs.

Prefill worker (GPU 0)

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-4B \
  --disaggregation-mode prefill \
  --port 30000 \
  --disaggregation-transfer-backend nixl

Decode worker (GPU 1)

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-4B \
  --disaggregation-mode decode \
  --port 30001 \
  --base-gpu-id 1 \
  --disaggregation-transfer-backend nixl

Start the router

python -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --host 0.0.0.0 \
  --port 8000

Test with OpenAI-Compatible Client

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

stream = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Be concise and accurate."},
        {"role": "user", "content": "Tell me about quantum computing"}
    ],
    stream=True,
    max_tokens=2000,
)

# Print tokens as they arrive
for chunk in stream:
    content = chunk.choices[0].delta.content or ""
    print(content, end="", flush=True)

print()  # Final newline

Output from decode instance:

[2026-02-22 12:50:43] Decode batch, #running-req: 1, #token: 780, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.11, #queue-req: 0, 
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 820, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.04, #queue-req: 0, 
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 860, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.42, #queue-req: 0, 
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 900, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.35, #queue-req: 0,

shahizat · February 22, 2026, 3:04pm

I evaluated disaggregated prefill and decode using SGLang with NIXL on a dual RTX PRO 6000 Blackwell setup. In this configuration, one GPU handles the prefill phase and the other handles decode, enforcing a fixed 1:1 ratio (1 Prefill : 1 Decode).

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model Qwen/Qwen3-4B \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100

Output


============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 100       
Successful requests:                     1000      
Benchmark duration (s):                  114.41    
Total input tokens:                      512842    
Total input text tokens:                 512842    
Total generated tokens:                  510855    
Total generated tokens (retokenized):    510839    
Request throughput (req/s):              8.74      
Input token throughput (tok/s):          4482.67   
Output token throughput (tok/s):         4465.30   
Peak output token throughput (tok/s):    6353.00   
Peak concurrent requests:                117       
Total token throughput (tok/s):          8947.96   
Concurrency:                             93.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10753.30  
Median E2E Latency (ms):                 10648.14  
P90 E2E Latency (ms):                    17957.11  
P99 E2E Latency (ms):                    22622.34  
---------------Time to First Token----------------
Mean TTFT (ms):                          3289.73   
Median TTFT (ms):                        2468.30   
P99 TTFT (ms):                           8982.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.54     
Median TPOT (ms):                        15.17     
P99 TPOT (ms):                           17.46     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.64     
Median ITL (ms):                         15.18     
P95 ITL (ms):                            18.14     
P99 ITL (ms):                            19.31     
Max ITL (ms):                            25.36     
==================================================

Comparison: Tensor Parallelism (TP=2)

I compared the above results with a standard TP=2 configuration using the same model and workload.

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-4B \
  --port 8000 \
  --tp 2

Output:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 100       
Successful requests:                     1000      
Benchmark duration (s):                  76.19     
Total input tokens:                      512842    
Total input text tokens:                 512842    
Total generated tokens:                  510855    
Total generated tokens (retokenized):    510833    
Request throughput (req/s):              13.13     
Input token throughput (tok/s):          6731.48   
Output token throughput (tok/s):         6705.40   
Peak output token throughput (tok/s):    8796.00   
Peak concurrent requests:                121       
Total token throughput (tok/s):          13436.89  
Concurrency:                             95.21     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7253.87   
Median E2E Latency (ms):                 7064.05   
P90 E2E Latency (ms):                    13515.42  
P99 E2E Latency (ms):                    14909.66  
---------------Time to First Token----------------
Mean TTFT (ms):                          125.31    
Median TTFT (ms):                        39.58     
P99 TTFT (ms):                           1350.39   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.25     
Median TPOT (ms):                        14.36     
P99 TPOT (ms):                           19.85     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.98     
Median ITL (ms):                         11.16     
P95 ITL (ms):                            27.90     
P99 ITL (ms):                            44.91     
Max ITL (ms):                            1285.39   
==================================================
==================================================

Dual RTX PRO 6000 Blackwell GPUs, standard Tensor Parallelism (TP=2) delivers significantly better throughput and latency than a 1:1 disaggregated prefill/decode setup.

The major bottleneck in the disaggregated configuration is KV cache transfer between GPUs. While NIXL enables this separation, moving KV cache data over PCIe introduces significant overhead compared to keeping data resident on the same GPU or sharing it efficiently via tensor parallelism. For multi-GPU systems, high-bandwidth interconnects such as NVLink are strongly recommended to minimize communication overhead between GPUs.

Topic		Replies	Views
Disaggregated Prefill/Decode Split Using LMCache(Dual NVIDIA RTX PRO 6000 Blackwell) TensorRT cuda	0	43	February 15, 2026
Build SGLang from source on Blackwell Pro 6000/ DGX Spark DGX Spark / GB10 jetson , nemotron	13	329	February 18, 2026
Setting up multiple instances of the SGLang server using router on the NVIDIA Jetson AGX Orin 64GB dev kit Jetson Projects	0	408	June 9, 2025
Run SGLang in Thor Jetson Thor	14	1163	December 2, 2025
New pre-built sglang Docker Images for NVIDIA DGX Spark DGX Spark / GB10 Projects	4	287	February 19, 2026
SOTA inference speed using SGlang and EAGLE-3 speculative decoding on the NVIDIA Jetson AGX Orin Jetson Projects llama-31-8b-instruct , llama	2	1040	March 23, 2025
Disaggregated Prefill/Decode using NVIDIA Dynamo (Dual NVIDIA RTX PRO 6000 Blackwell) TensorRT	1	41	February 16, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1161	December 7, 2025
Run SGLang in Spark DGX Spark / GB10	20	1780	November 28, 2025
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	4	187	January 9, 2025