Greetings to all,
Here is a guide on how to use the SGLang Router(Router for Data Parallelism β SGLang), which distributes requests to different SGlang instances with its unique cache-aware load-balancing algorithm. The maximum token generation for both instances was 50-78 tokens per second.
2025-06-08 18:57:01 INFO sglang_router_rs::router: src/router.rs:220: Processed Queue: {"http://127.0.0.1:10000": 17, "http://127.0.0.1:20000": 18}
2025-06-08 18:57:01 INFO sglang_router_rs::router: src/router.rs:224: Running Queue: {"http://127.0.0.1:10000": 8, "http://127.0.0.1:20000": 8}
Start first SGlang server:
python3 -m sglang.launch_server --model-path Qwen/Qwen3-4B --mem-fraction-static 0.3 --port 10000
Output
[2025-06-08 16:29:05] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=48.38 GB, mem usage=1.40 GB.
[2025-06-08 16:29:08] KV Cache is allocated. #tokens: 98417, K size: 6.76 GB, V size: 6.76 GB
[2025-06-08 16:29:08] Memory pool end. avail mem=34.38 GB
[2025-06-08 16:29:08] Capture cuda graph begin. This can take up to several minutes. avail mem=33.52 GB
[2025-06-08 16:29:08] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (avail_mem=31.54 GB): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 23/23 [00:15<00:00, 1.46it/s]
[2025-06-08 16:29:24] Capture cuda graph end. Time elapsed: 15.83 s. mem usage=2.00 GB. avail mem=31.52 GB.
[2025-06-08 16:29:25] max_total_num_tokens=98417, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=40960, available_gpu_mem=31.38 GB
[2025-06-08 16:29:26] INFO: Started server process [166744]
[2025-06-08 16:29:26] INFO: Waiting for application startup.
[2025-06-08 16:29:26] INFO: Application startup complete.
[2025-06-08 16:29:26] INFO: Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
[2025-06-08 16:29:27] INFO: 127.0.0.1:57554 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-08 16:29:27] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-08 16:29:29] INFO: 127.0.0.1:57566 - "POST /generate HTTP/1.1" 200 OK
[2025-06-08 16:29:29] The server is fired up and ready to roll!
[2025-06-08 17:48:01] INFO: 127.0.0.1:40816 - "GET /health HTTP/1.1" 200 OK
Start second SGlang server
python3 -m sglang.launch_server --model-path Qwen/Qwen3-4B --mem-fraction-static 0.6 --port 20000
Output
[2025-06-08 17:01:12] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=22.76 GB, mem usage=0.58 GB.
[2025-06-08 17:01:15] KV Cache is allocated. #tokens: 97665, K size: 6.71 GB, V size: 6.71 GB
[2025-06-08 17:01:15] Memory pool end. avail mem=9.20 GB
[2025-06-08 17:01:15] Capture cuda graph begin. This can take up to several minutes. avail mem=8.49 GB
[2025-06-08 17:01:15] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (avail_mem=6.63 GB): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 23/23 [00:15<00:00, 1.45it/s]
[2025-06-08 17:01:31] Capture cuda graph end. Time elapsed: 15.96 s. mem usage=1.91 GB. avail mem=6.58 GB.
[2025-06-08 17:01:32] max_total_num_tokens=97665, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=40960, available_gpu_mem=6.29 GB
[2025-06-08 17:01:33] INFO: Started server process [182330]
[2025-06-08 17:01:33] INFO: Waiting for application startup.
[2025-06-08 17:01:33] INFO: Application startup complete.
[2025-06-08 17:01:33] INFO: Uvicorn running on http://127.0.0.1:20000 (Press CTRL+C to quit)
[2025-06-08 17:01:34] INFO: 127.0.0.1:40530 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-08 17:01:34] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-08 17:01:36] INFO: 127.0.0.1:40534 - "POST /generate HTTP/1.1" 200 OK
[2025-06-08 17:01:36] The server is fired up and ready to roll!
[2025-06-08 17:48:01] INFO: 127.0.0.1:47802 - "GET /health HTTP/1.1" 200 OK
[2025-06-08 17:48:18] Detected chat template content format: string
This will start two SGlang servers, one on the port (10000 ) and one with port 20000 .
GPU utilization:
Start SGlang router
python -m sglang_router.launch_router --worker-urls http://127.0.0.1:10000 http://127.0.0.1:20000
Output:
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:191: π§ Initializing Prometheus metrics on 127.0.0.1:29000
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:200: π§ Initializing router on 127.0.0.1:30000
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:201: π§ Initializing workers on ["http://127.0.0.1:10000", "http://127.0.0.1:20000"]
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:202: π§ Policy Config: CacheAwareConfig { cache_threshold: 0.5, balance_abs_threshold: 32, balance_rel_threshold: 1.0001, eviction_interval_secs: 60, max_tree_size: 16777216, timeout_secs: 300, interval_secs: 10 }
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:203: π§ Max payload size: 4 MB
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:213: π§ Service discovery disabled
2025-06-08 17:48:01 INFO sglang_router_rs::router: src/router.rs:303: All workers are healthy
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:253: β
Serving router on 127.0.0.1:30000
2025-06-08 17:48:01 INFO sglang_router_rs::server: src/server.rs:254: β
Serving workers on ["http://127.0.0.1:10000", "http://127.0.0.1:20000"]
Finally, send requests to the port 30000 of OpenAI-like API endpoint. If your request load is low or infrequent, the router wonβt detect imbalance, and wonβt trigger shortest-queue load balancing.
