Hello,
I wanted to share working setup for running SGLang in disaggregated prefill/decode mode , and the NIXL transfer backend. This is useful for maximizing throughput on multi-GPU systems like H100 clusters.
Create and activate a virtual environment
uv venv .sglang --python 3.12
source .sglang/bin/activate
Installs the main SGLang package.
uv pip install sglang
Install PyTorch with CUDA 13.0 support
uv pip install --force-reinstall \
torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 \
--index-url https://download.pytorch.org/whl/cu130
Install SGLang kernel with CUDA 13.0 wheels
uv pip install --force-reinstall sgl-kernel \
--index-url https://docs.sglang.ai/whl/cu130/
Ensures compatibility with CUDA 13.0. --force-reinstall avoids conflicts with cached wheels.
Install NIXL with CUDA 13 support
uv pip install "nixl[cu13]"
Enables high-speed GPU-to-GPU transfer for disaggregation.
Install the SGLang router
uv pip install sglang-router
Launch Disaggregated Servers
Separates compute-heavy prefill from latency-sensitive decode stages across GPUs.
Prefill worker (GPU 0)
python -m sglang.launch_server \
--model-path Qwen/Qwen3-4B \
--disaggregation-mode prefill \
--port 30000 \
--disaggregation-transfer-backend nixl
Decode worker (GPU 1)
python -m sglang.launch_server \
--model-path Qwen/Qwen3-4B \
--disaggregation-mode decode \
--port 30001 \
--base-gpu-id 1 \
--disaggregation-transfer-backend nixl
Start the router
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://127.0.0.1:30000 \
--decode http://127.0.0.1:30001 \
--host 0.0.0.0 \
--port 8000
Test with OpenAI-Compatible Client
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
stream = client.chat.completions.create(
model="Qwen/Qwen3-4B",
messages=[
{"role": "system", "content": "You are a helpful assistant. Be concise and accurate."},
{"role": "user", "content": "Tell me about quantum computing"}
],
stream=True,
max_tokens=2000,
)
# Print tokens as they arrive
for chunk in stream:
content = chunk.choices[0].delta.content or ""
print(content, end="", flush=True)
print() # Final newline
Output from decode instance:
[2026-02-22 12:50:43] Decode batch, #running-req: 1, #token: 780, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.11, #queue-req: 0,
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 820, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.04, #queue-req: 0,
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 860, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.42, #queue-req: 0,
[2026-02-22 12:50:44] Decode batch, #running-req: 1, #token: 900, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 144.35, #queue-req: 0,