Hello,
This guide provides a step-by-step walkthrough for installing SGLang with CUDA 13.0 support, building the custom sgl-kernel, and launching the inference server.
1. Create Virtual Environment
uv venv .sglang --python 3.12
source .sglang/bin/activate
2. Install PyTorch (CUDA 13.0)
uv pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --force-reinstall --index-url https://download.pytorch.org/whl/cu130
3. Clone SGLang Repository
git clone https://github.com/sgl-project/sglang.git
cd sglang
uv pip install -e "python"
cd sgl-kernel
4. Install System Dependencies
sudo apt-get install -y libnuma-dev libibverbs-dev
uv pip install build wheel "cmake<4.0" ninja scikit-build-core
5. Set CUDA Environment Variables
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH`
6. Build Wheel
For DGX Spark.
TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1 python -m build --wheel --no-isolation
Set MAX_JOBS=4 and CMAKE_BUILD_PARALLEL_LEVEL=1 to ensure the RAM usage stays within safe limits.
For x86 Machine with 256GB of RAM and Blackwell 6000 Pro
TORCH_CUDA_ARCH_LIST="12.0" MAX_JOBS=$(nproc) CMAKE_BUILD_PARALLEL_LEVEL=8 python -m build --wheel --no-isolation
If you have significant headroom, you can utilize more cores to speed up the compilation.
Expected output:
Successfully built sgl_kernel-0.3.21-cp310-abi3-linux_x86_64.whl
7. Install the Built Wheel
uv pip install --no-deps dist/sgl_kernel*.whl
8. Launch the SGLang Server
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--trust-remote-code \
--tp 1 \
--attention-backend flashinfer \
--tool-call-parser qwen3_coder \
--reasoning-parser nano_v3 \
--mem-fraction-static 0.7 \
--max-running-requests 8
Output
b9ebcf14320097b02e63; skipping download.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:01, 2.46it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:01, 1.72it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.01it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:01<00:00, 2.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.36it/s]
[2026-02-16 17:04:21] Load weight end. elapsed=11.30 s, type=NemotronHForCausalLM, dtype=torch.bfloat16, avail mem=67.92 GB, mem usage=26.34 GB.
[2026-02-16 17:04:21] Using KV cache dtype: torch.bfloat16
[2026-02-16 17:04:21] Mamba Cache is allocated. max_mamba_cache_size: 410, conv_state size: 0.32GB, ssm_state size: 18.46GB
[2026-02-16 17:04:21] KV Cache is allocated. #tokens: 3648783, K size: 10.44 GB, V size: 10.44 GB
[2026-02-16 17:04:21] Memory pool end. avail mem=28.25 GB
[2026-02-16 17:04:21] Capture cuda graph begin. This can take up to several minutes. avail mem=27.86 GB
[2026-02-16 17:04:21] Capture cuda graph bs [1, 2, 4, 8]
Capturing batches (bs=1 avail_mem=27.72 GB): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [03:32<00:00, 53.12s/it]
[2026-02-16 17:07:54] Capture cuda graph end. Time elapsed: 213.08 s. mem usage=0.17 GB. avail mem=27.69 GB.
[2026-02-16 17:07:55] max_total_num_tokens=3648783, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8, context_len=262144, available_gpu_mem=27.69 GB
[2026-02-16 17:07:55] INFO: Started server process [58255]
[2026-02-16 17:07:55] INFO: Waiting for application startup.
[2026-02-16 17:07:55] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0}
[2026-02-16 17:07:55] INFO: Application startup complete.
[2026-02-16 17:07:55] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-02-16 17:07:56] INFO: [127.0.0.1:59928](https://127.0.0.1:59928) - "GET /model_info HTTP/1.1" 200 OK
[2026-02-16 17:07:59] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-02-16 17:07:59] INFO: [127.0.0.1:59942](https://127.0.0.1:59942) - "POST /generate HTTP/1.1" 200 OK
[2026-02-16 17:07:59] The server is fired up and ready to roll!
SGLang server is now successfully installed and running.