Setting up multiple instances of the SGLang server using router on the NVIDIA Jetson AGX Orin 64GB dev kit

Greetings to all,

Here is a guide on how to use the SGLang Router(Router for Data Parallelism β€” SGLang), which distributes requests to different SGlang instances with its unique cache-aware load-balancing algorithm. The maximum token generation for both instances was 50-78 tokens per second.

2025-06-08 18:57:01  INFO sglang_router_rs::router: src/router.rs:220: Processed Queue: {"http://127.0.0.1:10000": 17, "http://127.0.0.1:20000": 18}
2025-06-08 18:57:01  INFO sglang_router_rs::router: src/router.rs:224: Running Queue: {"http://127.0.0.1:10000": 8, "http://127.0.0.1:20000": 8}

Start first SGlang server:
python3 -m sglang.launch_server --model-path Qwen/Qwen3-4B --mem-fraction-static 0.3 --port 10000

Output

[2025-06-08 16:29:05] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=48.38 GB, mem usage=1.40 GB.
[2025-06-08 16:29:08] KV Cache is allocated. #tokens: 98417, K size: 6.76 GB, V size: 6.76 GB
[2025-06-08 16:29:08] Memory pool end. avail mem=34.38 GB
[2025-06-08 16:29:08] Capture cuda graph begin. This can take up to several minutes. avail mem=33.52 GB
[2025-06-08 16:29:08] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (avail_mem=31.54 GB): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 23/23 [00:15<00:00,  1.46it/s]
[2025-06-08 16:29:24] Capture cuda graph end. Time elapsed: 15.83 s. mem usage=2.00 GB. avail mem=31.52 GB.
[2025-06-08 16:29:25] max_total_num_tokens=98417, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=40960, available_gpu_mem=31.38 GB
[2025-06-08 16:29:26] INFO:     Started server process [166744]
[2025-06-08 16:29:26] INFO:     Waiting for application startup.
[2025-06-08 16:29:26] INFO:     Application startup complete.
[2025-06-08 16:29:26] INFO:     Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
[2025-06-08 16:29:27] INFO:     127.0.0.1:57554 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-08 16:29:27] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-08 16:29:29] INFO:     127.0.0.1:57566 - "POST /generate HTTP/1.1" 200 OK
[2025-06-08 16:29:29] The server is fired up and ready to roll!
[2025-06-08 17:48:01] INFO:     127.0.0.1:40816 - "GET /health HTTP/1.1" 200 OK

Start second SGlang server

python3 -m sglang.launch_server --model-path Qwen/Qwen3-4B --mem-fraction-static 0.6 --port 20000

Output

[2025-06-08 17:01:12] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=22.76 GB, mem usage=0.58 GB.
[2025-06-08 17:01:15] KV Cache is allocated. #tokens: 97665, K size: 6.71 GB, V size: 6.71 GB
[2025-06-08 17:01:15] Memory pool end. avail mem=9.20 GB
[2025-06-08 17:01:15] Capture cuda graph begin. This can take up to several minutes. avail mem=8.49 GB
[2025-06-08 17:01:15] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (avail_mem=6.63 GB): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 23/23 [00:15<00:00,  1.45it/s]
[2025-06-08 17:01:31] Capture cuda graph end. Time elapsed: 15.96 s. mem usage=1.91 GB. avail mem=6.58 GB.
[2025-06-08 17:01:32] max_total_num_tokens=97665, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=40960, available_gpu_mem=6.29 GB
[2025-06-08 17:01:33] INFO:     Started server process [182330]
[2025-06-08 17:01:33] INFO:     Waiting for application startup.
[2025-06-08 17:01:33] INFO:     Application startup complete.
[2025-06-08 17:01:33] INFO:     Uvicorn running on http://127.0.0.1:20000 (Press CTRL+C to quit)
[2025-06-08 17:01:34] INFO:     127.0.0.1:40530 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-08 17:01:34] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-08 17:01:36] INFO:     127.0.0.1:40534 - "POST /generate HTTP/1.1" 200 OK
[2025-06-08 17:01:36] The server is fired up and ready to roll!
[2025-06-08 17:48:01] INFO:     127.0.0.1:47802 - "GET /health HTTP/1.1" 200 OK
[2025-06-08 17:48:18] Detected chat template content format: string

This will start two SGlang servers, one on the port (10000 ) and one with port 20000 .

GPU utilization:

Start SGlang router

python -m sglang_router.launch_router --worker-urls http://127.0.0.1:10000 http://127.0.0.1:20000

Output:

2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:191: 🚧 Initializing Prometheus metrics on 127.0.0.1:29000
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:200: 🚧 Initializing router on 127.0.0.1:30000
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:201: 🚧 Initializing workers on ["http://127.0.0.1:10000", "http://127.0.0.1:20000"]
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:202: 🚧 Policy Config: CacheAwareConfig { cache_threshold: 0.5, balance_abs_threshold: 32, balance_rel_threshold: 1.0001, eviction_interval_secs: 60, max_tree_size: 16777216, timeout_secs: 300, interval_secs: 10 }
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:203: 🚧 Max payload size: 4 MB
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:213: 🚧 Service discovery disabled
2025-06-08 17:48:01  INFO sglang_router_rs::router: src/router.rs:303: All workers are healthy
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:253: βœ… Serving router on 127.0.0.1:30000
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:254: βœ… Serving workers on ["http://127.0.0.1:10000", "http://127.0.0.1:20000"]

Finally, send requests to the port 30000 of OpenAI-like API endpoint. If your request load is low or infrequent, the router won’t detect imbalance, and won’t trigger shortest-queue load balancing.

1 Like