Build SGLang from source on Blackwell Pro 6000/ DGX Spark

shahizat · February 16, 2026, 4:37pm

Hello,

This guide provides a step-by-step walkthrough for installing SGLang with CUDA 13.0 support, building the custom sgl-kernel, and launching the inference server.

1. Create Virtual Environment

uv venv .sglang --python 3.12
source .sglang/bin/activate

2. Install PyTorch (CUDA 13.0)

uv pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1  --force-reinstall   --index-url https://download.pytorch.org/whl/cu130

3. Clone SGLang Repository

git clone https://github.com/sgl-project/sglang.git
cd sglang
uv pip install -e "python"
cd sgl-kernel

4. Install System Dependencies

sudo apt-get install -y libnuma-dev libibverbs-dev
uv pip install build wheel "cmake<4.0" ninja scikit-build-core

5. Set CUDA Environment Variables

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH`

6. Build Wheel

For DGX Spark.

TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1  python -m build --wheel --no-isolation

Set MAX_JOBS=4 and CMAKE_BUILD_PARALLEL_LEVEL=1 to ensure the RAM usage stays within safe limits.

For x86 Machine with 256GB of RAM and Blackwell 6000 Pro

TORCH_CUDA_ARCH_LIST="12.0" MAX_JOBS=$(nproc) CMAKE_BUILD_PARALLEL_LEVEL=8 python -m build --wheel --no-isolation

If you have significant headroom, you can utilize more cores to speed up the compilation.

Expected output:

Successfully built sgl_kernel-0.3.21-cp310-abi3-linux_x86_64.whl

7. Install the Built Wheel

uv pip install --no-deps dist/sgl_kernel*.whl

8. Launch the SGLang Server

python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --trust-remote-code \
  --tp 1 \
  --attention-backend flashinfer \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nano_v3 \
  --mem-fraction-static 0.7 \
  --max-running-requests 8

Output

b9ebcf14320097b02e63; skipping download.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:01, 2.46it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:01, 1.72it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.01it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:01<00:00, 2.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.36it/s]

[2026-02-16 17:04:21] Load weight end. elapsed=11.30 s, type=NemotronHForCausalLM, dtype=torch.bfloat16, avail mem=67.92 GB, mem usage=26.34 GB.
[2026-02-16 17:04:21] Using KV cache dtype: torch.bfloat16
[2026-02-16 17:04:21] Mamba Cache is allocated. max_mamba_cache_size: 410, conv_state size: 0.32GB, ssm_state size: 18.46GB
[2026-02-16 17:04:21] KV Cache is allocated. #tokens: 3648783, K size: 10.44 GB, V size: 10.44 GB
[2026-02-16 17:04:21] Memory pool end. avail mem=28.25 GB
[2026-02-16 17:04:21] Capture cuda graph begin. This can take up to several minutes. avail mem=27.86 GB
[2026-02-16 17:04:21] Capture cuda graph bs [1, 2, 4, 8]
Capturing batches (bs=1 avail_mem=27.72 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [03:32<00:00, 53.12s/it]
[2026-02-16 17:07:54] Capture cuda graph end. Time elapsed: 213.08 s. mem usage=0.17 GB. avail mem=27.69 GB.
[2026-02-16 17:07:55] max_total_num_tokens=3648783, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8, context_len=262144, available_gpu_mem=27.69 GB
[2026-02-16 17:07:55] INFO: Started server process [58255]
[2026-02-16 17:07:55] INFO: Waiting for application startup.
[2026-02-16 17:07:55] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0}
[2026-02-16 17:07:55] INFO: Application startup complete.
[2026-02-16 17:07:55] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-02-16 17:07:56] INFO: [127.0.0.1:59928](https://127.0.0.1:59928) - "GET /model_info HTTP/1.1" 200 OK
[2026-02-16 17:07:59] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-02-16 17:07:59] INFO: [127.0.0.1:59942](https://127.0.0.1:59942) - "POST /generate HTTP/1.1" 200 OK
[2026-02-16 17:07:59] The server is fired up and ready to roll!

SGLang server is now successfully installed and running.

shahizat · February 17, 2026, 8:07am

Below are the performance and accuracy results of NVIDIA Nemotron-3-Nano-30B-A3B-NVFP4 running on a system equipped with two Blackwell Pro GPUs.

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  117.94
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  510855
Total generated tokens (retokenized):    442114
Request throughput (req/s):              8.48
Input token throughput (tok/s):          4348.25
Output token throughput (tok/s):         4331.40
Peak output token throughput (tok/s):    6041.00
Peak concurrent requests:                116
Total token throughput (tok/s):          8679.65
Concurrency:                             95.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11279.95
Median E2E Latency (ms):                 10775.53
P90 E2E Latency (ms):                    20573.32
P99 E2E Latency (ms):                    24219.32
---------------Time to First Token----------------
Mean TTFT (ms):                          486.52
Median TTFT (ms):                        42.17
P99 TTFT (ms):                           5488.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.87
Median TPOT (ms):                        21.17
P99 TPOT (ms):                           49.19
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.21
Median ITL (ms):                         15.48
P95 ITL (ms):                            49.18
P99 ITL (ms):                            79.59
Max ITL (ms):                            5965.74
==================================================

To evaluate model accuracy, I ran the MMLU benchmark using lm_eval with the local OpenAI-compatible completions endpoint.

lm_eval \
  --model local-completions \
  --tasks mmlu \
  --model_args "model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4,base_url=http://127.0.0.1:30000/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False,max_lengths=16384" \
  --gen_kwargs '{"chat_template_kwargs":{"thinking":true}}' \
  --batch_size 256

Output:

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |↑  |0.7044|±  |0.0036|
| - humanities                          |      2|none  |      |acc   |↑  |0.6185|±  |0.0066|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.6032|±  |0.0438|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.7758|±  |0.0326|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.8676|±  |0.0238|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.8312|±  |0.0244|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.8512|±  |0.0325|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.7778|±  |0.0402|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.7239|±  |0.0351|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.7659|±  |0.0228|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.3542|±  |0.0160|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.7428|±  |0.0248|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.7994|±  |0.0223|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.5280|±  |0.0128|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.8480|±  |0.0275|
| - other                               |      2|none  |      |acc   |↑  |0.7564|±  |0.0074|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.6800|±  |0.0469|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.7585|±  |0.0263|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.6532|±  |0.0363|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.4600|±  |0.0501|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.6996|±  |0.0308|
|  - management                         |      1|none  |     0|acc   |↑  |0.7670|±  |0.0419|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.8889|±  |0.0206|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |0.8500|±  |0.0359|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.8608|±  |0.0124|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.8333|±  |0.0213|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.5071|±  |0.0298|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.8382|±  |0.0224|
|  - virology                           |      1|none  |     0|acc   |↑  |0.5663|±  |0.0386|
| - social sciences                     |      2|none  |      |acc   |↑  |0.8063|±  |0.0070|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.5702|±  |0.0466|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.8333|±  |0.0266|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |0.9067|±  |0.0210|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.7872|±  |0.0208|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.8613|±  |0.0224|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9064|±  |0.0125|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.8168|±  |0.0339|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.7467|±  |0.0176|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.7455|±  |0.0417|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.6531|±  |0.0305|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.8557|±  |0.0248|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9200|±  |0.0273|
| - stem                                |      2|none  |      |acc   |↑  |0.6819|±  |0.0080|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.5700|±  |0.0498|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.6815|±  |0.0402|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.8421|±  |0.0297|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.8611|±  |0.0289|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.5800|±  |0.0496|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.6200|±  |0.0488|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.5200|±  |0.0502|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.5686|±  |0.0493|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.7800|±  |0.0416|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.8170|±  |0.0253|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.7172|±  |0.0375|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.6058|±  |0.0252|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.8774|±  |0.0187|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.6749|±  |0.0330|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.7600|±  |0.0429|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.4741|±  |0.0304|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.6159|±  |0.0397|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.6852|±  |0.0317|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.5536|±  |0.0472|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7044|±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.6185|±  |0.0066|
| - other          |      2|none  |      |acc   |↑  |0.7564|±  |0.0074|
| - social sciences|      2|none  |      |acc   |↑  |0.8063|±  |0.0070|
| - stem           |      2|none  |      |acc   |↑  |0.6819|±  |0.0080|

The model achieves a solid ~70% MMLU accuracy using NVFP4 quantization.

shahizat · February 17, 2026, 8:41am

BF16 is about 0.9 percentage points higher than NVFP4 in overall accuracy.

Here is the accuracy on MMLU of BF16 version of the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16:


|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |↑  |0.7135|±  |0.0036|
| - humanities                          |      2|none  |      |acc   |↑  |0.6304|±  |0.0066|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.5794|±  |0.0442|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.7879|±  |0.0319|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.8725|±  |0.0234|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.8354|±  |0.0241|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.8760|±  |0.0301|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.7778|±  |0.0402|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.7669|±  |0.0332|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.7832|±  |0.0222|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.3709|±  |0.0162|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.7363|±  |0.0250|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.8241|±  |0.0212|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.5398|±  |0.0127|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.8480|±  |0.0275|
| - other                               |      2|none  |      |acc   |↑  |0.7576|±  |0.0073|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.6900|±  |0.0465|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.7736|±  |0.0258|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.6590|±  |0.0361|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.4300|±  |0.0498|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.7130|±  |0.0304|
|  - management                         |      1|none  |     0|acc   |↑  |0.8155|±  |0.0384|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.8974|±  |0.0199|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |0.8200|±  |0.0386|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.8595|±  |0.0124|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.8268|±  |0.0217|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.5177|±  |0.0298|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.8235|±  |0.0232|
|  - virology                           |      1|none  |     0|acc   |↑  |0.5542|±  |0.0387|
| - social sciences                     |      2|none  |      |acc   |↑  |0.8161|±  |0.0068|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.5439|±  |0.0469|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.8737|±  |0.0237|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |0.9016|±  |0.0215|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.7923|±  |0.0206|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.8824|±  |0.0209|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9229|±  |0.0114|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.8092|±  |0.0345|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.7598|±  |0.0173|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.7182|±  |0.0431|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.6816|±  |0.0298|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.8458|±  |0.0255|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9300|±  |0.0256|
| - stem                                |      2|none  |      |acc   |↑  |0.6939|±  |0.0079|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.5600|±  |0.0499|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.7111|±  |0.0392|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.8487|±  |0.0292|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.8681|±  |0.0283|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.6100|±  |0.0490|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.6400|±  |0.0482|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.5500|±  |0.0500|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.5588|±  |0.0494|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.7900|±  |0.0409|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.8000|±  |0.0261|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.7034|±  |0.0381|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.6190|±  |0.0250|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.8871|±  |0.0180|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.7192|±  |0.0316|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.8200|±  |0.0386|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.4926|±  |0.0305|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.6093|±  |0.0398|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.6898|±  |0.0315|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.5804|±  |0.0468|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7135|±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.6304|±  |0.0066|
| - other          |      2|none  |      |acc   |↑  |0.7576|±  |0.0073|
| - social sciences|      2|none  |      |acc   |↑  |0.8161|±  |0.0068|
| - stem           |      2|none  |      |acc   |↑  |0.6939|±  |0.0079|

shahizat · February 17, 2026, 12:28pm

Below is the issue encountered when running the sglang benchmark for NVIDIA Nemotron-3-Nano-30B-A3B-NVFP4 on a DGX Spark. The sglang server starts fine initially, but the issue occurs after running multiple requests.

[2026-02-17 17:10:44] INFO:     127.0.0.1:45064 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:44] Prefill batch, #new-seq: 1, #new-token: 689, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 628.58, cuda graph: False
[2026-02-17 17:10:45] INFO:     127.0.0.1:45074 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:45] Prefill batch, #new-seq: 1, #new-token: 119, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 515.77, cuda graph: False
[2026-02-17 17:10:46] INFO:     127.0.0.1:45076 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:46] Prefill batch, #new-seq: 1, #new-token: 147, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 183.49, cuda graph: False
[2026-02-17 17:10:46] INFO:     127.0.0.1:45084 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:46] INFO:     127.0.0.1:45086 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:46] Prefill batch, #new-seq: 1, #new-token: 352, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 366.23, cuda graph: False
[2026-02-17 17:10:47] Prefill batch, #new-seq: 1, #new-token: 652, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 100, #queue-req: 0, input throughput (token/s): 3028.01, cuda graph: False
[2026-02-17 17:10:49] INFO:     127.0.0.1:45088 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:49] Decode batch, #running-req: 99, #full token: 81467, full token usage: 0.01, mamba num: 198, mamba usage: 0.37, cuda graph: True, gen throughput (token/s): 654.73, #queue-req: 0
[2026-02-17 17:10:50] Prefill batch, #new-seq: 1, #new-token: 652, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 214.54, cuda graph: False
[2026-02-17 17:10:50] INFO:     127.0.0.1:45104 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:51] Prefill batch, #new-seq: 1, #new-token: 367, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 607.58, cuda graph: False
[2026-02-17 17:10:52] INFO:     127.0.0.1:45112 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:52] INFO:     127.0.0.1:45114 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:52] Prefill batch, #new-seq: 1, #new-token: 437, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 301.61, cuda graph: False
[2026-02-17 17:10:52] Prefill batch, #new-seq: 1, #new-token: 845, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 100, #queue-req: 0, input throughput (token/s): 3609.84, cuda graph: False
[2026-02-17 17:10:53] INFO:     127.0.0.1:50998 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:53] Prefill batch, #new-seq: 1, #new-token: 581, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 567.50, cuda graph: False
[2026-02-17 17:10:54] INFO:     127.0.0.1:51008 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:55] Prefill batch, #new-seq: 1, #new-token: 174, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 486.50, cuda graph: False
[2026-02-17 17:10:56] Decode batch, #running-req: 100, #full token: 84983, full token usage: 0.01, mamba num: 200, mamba usage: 0.37, cuda graph: True, gen throughput (token/s): 643.69, #queue-req: 0
[2026-02-17 17:10:57] INFO:     127.0.0.1:51012 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:57] Prefill batch, #new-seq: 1, #new-token: 193, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 80.44, cuda graph: False
[2026-02-17 17:10:57] INFO:     127.0.0.1:51022 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:57] INFO:     127.0.0.1:51038 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:57] Prefill batch, #new-seq: 2, #new-token: 643, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 98, #queue-req: 0, input throughput (token/s): 451.78, cuda graph: False
[2026-02-17 17:10:58] INFO:     127.0.0.1:51048 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:58] Prefill batch, #new-seq: 1, #new-token: 776, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 651.42, cuda graph: False
[2026-02-17 17:11:00] INFO:     127.0.0.1:51064 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:00] Prefill batch, #new-seq: 1, #new-token: 774, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 507.31, cuda graph: False
[2026-02-17 17:11:01] INFO:     127.0.0.1:51066 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:01] Prefill batch, #new-seq: 1, #new-token: 247, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 639.07, cuda graph: False
[2026-02-17 17:11:02] Decode batch, #running-req: 100, #full token: 85531, full token usage: 0.01, mamba num: 200, mamba usage: 0.37, cuda graph: True, gen throughput (token/s): 646.75, #queue-req: 0
[2026-02-17 17:11:04] INFO:     127.0.0.1:54114 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:04] Prefill batch, #new-seq: 1, #new-token: 290, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 79.40, cuda graph: False
[2026-02-17 17:11:05] INFO:     127.0.0.1:54118 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:05] INFO:     127.0.0.1:54134 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:05] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 3162, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/home/spark/.sglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 1112, in event_loop_normal
    self.process_batch_result(batch, result)
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 2472, in process_batch_result
    self.process_batch_result_prefill(batch, result)
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 148, in process_batch_result_prefill
    next_token_ids = next_token_ids.tolist()
                     ^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal instruction was encountered
Search for `cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[2026-02-17 17:11:05] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed

paulsc.liu · February 17, 2026, 1:53pm

I think this version will still have problem with Sglang[diffusion]. You will need to build sgl-kernel from source which I am still trying.

This link provide more information:

github.com/sgl-project/sglang

[Bug] DGX Spark `[sgl_kernel] CRITICAL: Could not load any common_ops library!`

opened 11:34PM - 03 Feb 26 UTC

mattharrison

### Checklist - [x] I searched related issues but found no solution. - [x] The …bug persists in the latest version. - [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback. - [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed. - [x] Please use English. Otherwise, it will be closed. ### Describe the bug Trying to install on DGX Spark (w/o Docker). The install appears to work but I get the error described below when attempting to start the server. `pyproject.toml`: ```toml [project] name = "sglang-feb26-qwen3-coder-next" version = "0.1.0" requires-python = ">=3.13" dependencies = [ "fastapi>=0.128.0", "huggingface-hub>=0.36.1", "openai>=2.16.0", "orjson>=3.11.7", "packaging>=24.1", "partial-json-parser>=0.2.1.1.post7", "pillow>=12.0.0", "psutil>=7.2.2", "pybase64>=1.4.3", "pydantic>=2.12.5", "sentencepiece>=0.2.1", "sgl-kernel>=0.3.21", "sglang>=0.5.2", "torch>=2.10.0", "transformers==4.57.1", "uvicorn>=0.40.0", "uvloop>=0.22.1", "zmq>=0.0.0", ] [tool.uv.sources] torch = { index = "pytorch" } [[tool.uv.index]] name = "pytorch" url = "https://download.pytorch.org/whl/cu130" explicit = true ``` Create env with `uv sync` This appears to work. Launch server: ```bash $ uv run python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 Uninstalled 1 package in 0.97ms Installed 1 package in 2ms /home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py:435: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) queued_call() Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/launch_server.py", line 6, in <module> from sglang.srt.entrypoints.http_server import launch_server File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/entrypoints/http_server.py", line 51, in <module> from sglang.srt.entrypoints.engine import _launch_subprocesses File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/entrypoints/engine.py", line 43, in <module> from sglang.srt.managers.data_parallel_controller import ( run_data_parallel_controller_process, ) File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/managers/data_parallel_controller.py", line 33, in <module> from sglang.srt.managers.io_struct import ( ...<3 lines>... ) File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/managers/io_struct.py", line 26, in <module> from sglang.srt.managers.schedule_batch import BaseFinishReason File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/managers/schedule_batch.py", line 51, in <module> from sglang.srt.disaggregation.decode_schedule_batch_mixin import ( ScheduleBatchDisaggregationDecodeMixin, ) File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/disaggregation/decode_schedule_batch_mixin.py", line 10, in <module> from sglang.srt.model_executor.forward_batch_info import CaptureHiddenMode, ForwardMode File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/model_executor/forward_batch_info.py", line 48, in <module> from sglang.srt.layers.rotary_embedding import MRotaryEmbedding File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sglang/srt/layers/rotary_embedding.py", line 29, in <module> from sgl_kernel import apply_rope_with_cos_sin_cache_inplace File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sgl_kernel/__init__.py", line 5, in <module> common_ops = _load_architecture_specific_ops() File "/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sgl_kernel/load_utils.py", line 188, in _load_architecture_specific_ops raise ImportError(error_msg) ImportError: [sgl_kernel] CRITICAL: Could not load any common_ops library! Attempted locations: 1. Architecture-specific pattern: /home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sgl_kernel/sm100/common_ops.* - found files: ['/home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sgl_kernel/sm100/common_ops.abi3.so'] 2. Fallback pattern: /home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/sgl_kernel/common_ops.* - found files: [] 3. Standard Python import: common_ops - failed GPU Info: - Compute capability: 121 - Expected variant: SM121 (precise math for compatibility) Please ensure sgl_kernel is properly installed with: pip install --upgrade sgl_kernel Error details from previous import attempts: - ImportError: libnvrtc.so.12: cannot open shared object file: No such file or directory - ModuleNotFoundError: No module named 'common_ops' ``` ### Reproduction uv run python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 ### Environment ``` $ uv run python3 -m sglang.check_env Uninstalled 1 package in 0.54ms Installed 1 package in 2ms /home/matt/dev/sglang-feb26-qwen3-coder-next/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py:435: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) queued_call() Python: 3.13.11 (main, Dec 17 2025, 21:09:48) [Clang 21.1.4 ] CUDA available: True GPU 0: NVIDIA GB10 GPU 0 Compute Capability: 12.1 CUDA_HOME: /usr/local/cuda-13 NVCC: Cuda compilation tools, release 13.0, V13.0.88 CUDA Driver Version: 580.95.05 PyTorch: 2.10.0+cu130 sglang: 0.5.2 sgl_kernel: 0.3.21 flashinfer_python: Module Not Found triton: 3.6.0 transformers: 4.57.1 torchao: Module Not Found numpy: 2.3.5 aiohttp: 3.13.3 fastapi: 0.128.0 hf_transfer: Module Not Found huggingface_hub: 0.36.1 interegular: Module Not Found modelscope: Module Not Found orjson: 3.11.7 outlines: Module Not Found packaging: 24.1 psutil: 7.2.2 pydantic: 2.12.5 python-multipart: Module Not Found pyzmq: 27.1.0 uvicorn: 0.40.0 uvloop: 0.22.1 vllm: Module Not Found xgrammar: Module Not Found openai: 2.16.0 tiktoken: Module Not Found anthropic: Module Not Found litellm: Module Not Found decord: Module Not Found NVIDIA Topology: GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE NODE 0-19 0 N/A NIC0 NODE X PIX NODE NODE NIC1 NODE PIX X NODE NODE NIC2 NODE NODE NODE X PIX NIC3 NODE NODE NODE PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: rocep1s0f0 NIC1: rocep1s0f1 NIC2: roceP2p1s0f0 NIC3: roceP2p1s0f1 ulimit soft: 500000 ```

johnny_nv · February 17, 2026, 2:29pm

this error is because sgl-kernel not match with your pytorch + cuda version installed

paulsc.liu · February 18, 2026, 5:30am

Thanks for the suggestion. After checking my installation steps, I found the problem. I did not specify the torch version since sgl-kernel needs torch == 2.9.1

I was able to the make guild the sgl-kernel after I follow exactly the process specified in this post.

I only have the following to add:

If you need to use stable diffusion models, you will need to add the following after you have completed the installation

uv pip install remote_pdb
uv pip install imageio
uv pip install diffusers
uv pip install addict
uv pip install cache_dit

paulsc.liu · February 18, 2026, 5:46am

I got some interesting result. My previous attempt always crash at

[02-18 13:35:00] [DenoisingStage] started…

The new build was successful in DenoisingState, DecodingStage but got the following right after DecodingStage

[02-18 13:35:33] [DecodingStage] finished in 2.0944 seconds
[02-18 13:35:33] Error executing request 33b9258f-0f2c-45f1-80e0-5e2a3ab2cfe1: Not Supported
Traceback (most recent call last):
File “/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py”, line 243, in execute_forward
self.do_mem_analysis(output_batch)
File “/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py”, line 165, in do_mem_analysis
current_platform.get_device_total_memory() / (1024**3) - peak_reserved_gb
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/platforms/cuda.py”, line 64, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/platforms/cuda.py”, line 437, in get_device_total_memory
return int(pynvml.nvmlDeviceGetMemoryInfo(handle).total)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/third_party/pynvml.py”, line 3782, in nvmlDeviceGetMemoryInfo
_nvmlCheckReturn(ret)
File “/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/third_party/pynvml.py”, line 1305, in _nvmlCheckReturn
raise NVMLError(ret)
sglang.multimodal_gen.third_party.pynvml.NVMLError_NotSupported: Not Supported
[02-18 13:35:33] Output saved to outputs/sample_0_33b9258f-0f2c-45f1-80e0-5e2a3ab2cfe1.jpg

The generated image was stored in the server but never send back through the HTTP.

Maybe I made some mistake with the http request:

curl http://192.168.1.109:30000/v1/images/generations
-o >(jq -r ‘.data[0].b64_json’ | base64 --decode > example.png)
-H “Content-Type: application/json”
-d ‘{
“model”: “Tongyi-MAI/Z-Image-Turbo”,
“prompt”: “A cute baby sea otter”,
“n”: 1,
“size”: “1024x1024”,
“response_format”: “b64_json”
}’ | jq -r ‘.choices[0].message.content[0].image_url.url’ | cut -d’,’ -f2 | base64 -d > otter.png

I will do more testing to trace the problem.

eugr · February 18, 2026, 7:02am

Does SGLang use Flashinfer as well? It could be that it hits the same issue as vLLM.

AoE · February 18, 2026, 8:26am

Yes

AoE · February 18, 2026, 8:38am

There are fixes missing in FlashInfer. See the bottom of 18203 for the list.

Also, work is being done in 18862 to make SGLang compatible with PyTorch 2.10

shahizat · February 18, 2026, 9:50am

Yes, I reported it to the Nvidia Nemotron team( cc @calexiuk). Surprisingly, it works fine on the x86 machine.

paulsc.liu · February 18, 2026, 11:58am

LTX-2 claimed that they will soon release open weight model that rival Seedance 2.0. Sglang and vLLM might be able to run it with Cache-DiT.

I hope all these problem get fixed by then.

It is funny that Hollywood, TV (Disney, Netflix,etc…) will be the first industry that face existential crisis with AI. Programmers are still safe.

Disney threaten lawsuit is like blood that excit the great white sharks in China.

It is my opinion that DGX Spark is a dangerous machine because you can run it with solar power setup and easy to carry. You can use it to generate video/image/audio that is capable to cause great harm through social media

paulsc.liu · February 18, 2026, 1:18pm

The error message still exist but I can confirm that Sglang server do send the generated image through HTTP.

A working curl is the following:

curl http://192.168.1.109:30000/v1/images/generations
-o >(jq -r ‘.data[0].b64_json’ | base64 --decode > cat1.png)
-H “Content-Type: application/json”
-d ‘{
“model”: “Tongyi-MAI/Z-Image-Turbo”,
“prompt”: “cartoon cat eats fish”,
“n”: 1,
“size”: “1024x1024”,
“response_format”: “b64_json”
}’

system · March 4, 2026, 1:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Run SGLang in Thor Jetson Thor	14	1452	December 2, 2025
Run SGLang in Spark DGX Spark / GB10	20	2293	November 28, 2025
New pre-built sglang Docker Images for NVIDIA DGX Spark DGX Spark / GB10 Projects	22	1430	April 9, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1606	December 7, 2025
Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError) DGX Spark / GB10 cuda	15	354	January 1, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	36	1377	April 9, 2026
Running SGLang Diffusion Inference DGX Spark / GB10	3	250	January 27, 2026
Error when use SGLang:26.03-py3 to deploy Qwen3.5-35B-A3B-FP8 DGX Spark / GB10	1	97	April 1, 2026
Setting up multiple instances of the SGLang server using router on the NVIDIA Jetson AGX Orin 64GB dev kit Jetson Projects	0	483	June 9, 2025
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	42	3011	February 7, 2026

Build SGLang from source on Blackwell Pro 6000/ DGX Spark

Related topics