Setting up multiple instances of the SGLang server using router on the NVIDIA Jetson AGX Orin 64GB dev kit

shahizat · June 9, 2025, 6:44am

Greetings to all,

Here is a guide on how to use the SGLang Router(Router for Data Parallelism — SGLang), which distributes requests to different SGlang instances with its unique cache-aware load-balancing algorithm. The maximum token generation for both instances was 50-78 tokens per second.

2025-06-08 18:57:01  INFO sglang_router_rs::router: src/router.rs:220: Processed Queue: {"http://127.0.0.1:10000": 17, "http://127.0.0.1:20000": 18}
2025-06-08 18:57:01  INFO sglang_router_rs::router: src/router.rs:224: Running Queue: {"http://127.0.0.1:10000": 8, "http://127.0.0.1:20000": 8}

Start first SGlang server:
python3 -m sglang.launch_server --model-path Qwen/Qwen3-4B --mem-fraction-static 0.3 --port 10000

Output

[2025-06-08 16:29:05] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=48.38 GB, mem usage=1.40 GB.
[2025-06-08 16:29:08] KV Cache is allocated. #tokens: 98417, K size: 6.76 GB, V size: 6.76 GB
[2025-06-08 16:29:08] Memory pool end. avail mem=34.38 GB
[2025-06-08 16:29:08] Capture cuda graph begin. This can take up to several minutes. avail mem=33.52 GB
[2025-06-08 16:29:08] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (avail_mem=31.54 GB): 100%|█████████████████████████████████████████████████████████████████████████████████| 23/23 [00:15<00:00,  1.46it/s]
[2025-06-08 16:29:24] Capture cuda graph end. Time elapsed: 15.83 s. mem usage=2.00 GB. avail mem=31.52 GB.
[2025-06-08 16:29:25] max_total_num_tokens=98417, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=40960, available_gpu_mem=31.38 GB
[2025-06-08 16:29:26] INFO:     Started server process [166744]
[2025-06-08 16:29:26] INFO:     Waiting for application startup.
[2025-06-08 16:29:26] INFO:     Application startup complete.
[2025-06-08 16:29:26] INFO:     Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
[2025-06-08 16:29:27] INFO:     127.0.0.1:57554 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-08 16:29:27] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-08 16:29:29] INFO:     127.0.0.1:57566 - "POST /generate HTTP/1.1" 200 OK
[2025-06-08 16:29:29] The server is fired up and ready to roll!
[2025-06-08 17:48:01] INFO:     127.0.0.1:40816 - "GET /health HTTP/1.1" 200 OK

Start second SGlang server

python3 -m sglang.launch_server --model-path Qwen/Qwen3-4B --mem-fraction-static 0.6 --port 20000

Output

[2025-06-08 17:01:12] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=22.76 GB, mem usage=0.58 GB.
[2025-06-08 17:01:15] KV Cache is allocated. #tokens: 97665, K size: 6.71 GB, V size: 6.71 GB
[2025-06-08 17:01:15] Memory pool end. avail mem=9.20 GB
[2025-06-08 17:01:15] Capture cuda graph begin. This can take up to several minutes. avail mem=8.49 GB
[2025-06-08 17:01:15] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (avail_mem=6.63 GB): 100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:15<00:00,  1.45it/s]
[2025-06-08 17:01:31] Capture cuda graph end. Time elapsed: 15.96 s. mem usage=1.91 GB. avail mem=6.58 GB.
[2025-06-08 17:01:32] max_total_num_tokens=97665, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=40960, available_gpu_mem=6.29 GB
[2025-06-08 17:01:33] INFO:     Started server process [182330]
[2025-06-08 17:01:33] INFO:     Waiting for application startup.
[2025-06-08 17:01:33] INFO:     Application startup complete.
[2025-06-08 17:01:33] INFO:     Uvicorn running on http://127.0.0.1:20000 (Press CTRL+C to quit)
[2025-06-08 17:01:34] INFO:     127.0.0.1:40530 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-08 17:01:34] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-08 17:01:36] INFO:     127.0.0.1:40534 - "POST /generate HTTP/1.1" 200 OK
[2025-06-08 17:01:36] The server is fired up and ready to roll!
[2025-06-08 17:48:01] INFO:     127.0.0.1:47802 - "GET /health HTTP/1.1" 200 OK
[2025-06-08 17:48:18] Detected chat template content format: string

This will start two SGlang servers, one on the port (10000 ) and one with port 20000 .

GPU utilization:

Start SGlang router

python -m sglang_router.launch_router --worker-urls http://127.0.0.1:10000 http://127.0.0.1:20000

Output:

2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:191: 🚧 Initializing Prometheus metrics on 127.0.0.1:29000
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:200: 🚧 Initializing router on 127.0.0.1:30000
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:201: 🚧 Initializing workers on ["http://127.0.0.1:10000", "http://127.0.0.1:20000"]
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:202: 🚧 Policy Config: CacheAwareConfig { cache_threshold: 0.5, balance_abs_threshold: 32, balance_rel_threshold: 1.0001, eviction_interval_secs: 60, max_tree_size: 16777216, timeout_secs: 300, interval_secs: 10 }
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:203: 🚧 Max payload size: 4 MB
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:213: 🚧 Service discovery disabled
2025-06-08 17:48:01  INFO sglang_router_rs::router: src/router.rs:303: All workers are healthy
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:253: ✅ Serving router on 127.0.0.1:30000
2025-06-08 17:48:01  INFO sglang_router_rs::server: src/server.rs:254: ✅ Serving workers on ["http://127.0.0.1:10000", "http://127.0.0.1:20000"]

Finally, send requests to the port 30000 of OpenAI-like API endpoint. If your request load is low or infrequent, the router won’t detect imbalance, and won’t trigger shortest-queue load balancing.

Topic		Replies	Views
Build SGLang from source on Blackwell Pro 6000/ DGX Spark DGX Spark / GB10 jetson , nemotron	14	508	March 4, 2026
Run SGLang in Thor Jetson Thor	14	1319	December 2, 2025
SOTA inference speed using SGlang and EAGLE-3 speculative decoding on the NVIDIA Jetson AGX Orin Jetson Projects llama-31-8b-instruct , llama	2	1066	March 23, 2025
New pre-built sglang Docker Images for NVIDIA DGX Spark DGX Spark / GB10 Projects	8	684	March 7, 2026
Run SGLang in Spark DGX Spark / GB10	20	2067	November 28, 2025
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1390	December 7, 2025
Unable to use newest sglang on Jetson Orin 64GB Jetson AGX Orin containers	7	283	November 18, 2025
Disaggregated Prefill/Decode using SGlang and nixl (Dual NVIDIA RTX PRO 6000 Blackwell) TensorRT	1	82	February 22, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	32	1057	February 26, 2026
Running SGLang Diffusion Inference DGX Spark / GB10	3	194	January 27, 2026

Setting up multiple instances of the SGLang server using router on the NVIDIA Jetson AGX Orin 64GB dev kit

Related topics