Build SGLang from source on Blackwell Pro 6000/ DGX Spark

Hello,

This guide provides a step-by-step walkthrough for installing SGLang with CUDA 13.0 support, building the custom sgl-kernel, and launching the inference server.

1. Create Virtual Environment

uv venv .sglang --python 3.12
source .sglang/bin/activate

2. Install PyTorch (CUDA 13.0)

uv pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1  --force-reinstall   --index-url https://download.pytorch.org/whl/cu130

3. Clone SGLang Repository

git clone https://github.com/sgl-project/sglang.git
cd sglang
uv pip install -e "python"
cd sgl-kernel

4. Install System Dependencies

sudo apt-get install -y libnuma-dev libibverbs-dev
uv pip install build wheel "cmake<4.0" ninja scikit-build-core

5. Set CUDA Environment Variables

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH`

6. Build Wheel

For DGX Spark.

TORCH_CUDA_ARCH_LIST="12.1a" MAX_JOBS=4 CMAKE_BUILD_PARALLEL_LEVEL=1  python -m build --wheel --no-isolation

Set MAX_JOBS=4 and CMAKE_BUILD_PARALLEL_LEVEL=1 to ensure the RAM usage stays within safe limits.

For x86 Machine with 256GB of RAM and Blackwell 6000 Pro

TORCH_CUDA_ARCH_LIST="12.0" MAX_JOBS=$(nproc) CMAKE_BUILD_PARALLEL_LEVEL=8 python -m build --wheel --no-isolation

If you have significant headroom, you can utilize more cores to speed up the compilation.

Expected output:

Successfully built sgl_kernel-0.3.21-cp310-abi3-linux_x86_64.whl

7. Install the Built Wheel

uv pip install --no-deps dist/sgl_kernel*.whl

8. Launch the SGLang Server

python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --trust-remote-code \
  --tp 1 \
  --attention-backend flashinfer \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nano_v3 \
  --mem-fraction-static 0.7 \
  --max-running-requests 8

Output

b9ebcf14320097b02e63; skipping download.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:01, 2.46it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:01, 1.72it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.01it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:01<00:00, 2.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.36it/s]

[2026-02-16 17:04:21] Load weight end. elapsed=11.30 s, type=NemotronHForCausalLM, dtype=torch.bfloat16, avail mem=67.92 GB, mem usage=26.34 GB.
[2026-02-16 17:04:21] Using KV cache dtype: torch.bfloat16
[2026-02-16 17:04:21] Mamba Cache is allocated. max_mamba_cache_size: 410, conv_state size: 0.32GB, ssm_state size: 18.46GB
[2026-02-16 17:04:21] KV Cache is allocated. #tokens: 3648783, K size: 10.44 GB, V size: 10.44 GB
[2026-02-16 17:04:21] Memory pool end. avail mem=28.25 GB
[2026-02-16 17:04:21] Capture cuda graph begin. This can take up to several minutes. avail mem=27.86 GB
[2026-02-16 17:04:21] Capture cuda graph bs [1, 2, 4, 8]
Capturing batches (bs=1 avail_mem=27.72 GB): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [03:32<00:00, 53.12s/it]
[2026-02-16 17:07:54] Capture cuda graph end. Time elapsed: 213.08 s. mem usage=0.17 GB. avail mem=27.69 GB.
[2026-02-16 17:07:55] max_total_num_tokens=3648783, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8, context_len=262144, available_gpu_mem=27.69 GB
[2026-02-16 17:07:55] INFO: Started server process [58255]
[2026-02-16 17:07:55] INFO: Waiting for application startup.
[2026-02-16 17:07:55] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0}
[2026-02-16 17:07:55] INFO: Application startup complete.
[2026-02-16 17:07:55] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-02-16 17:07:56] INFO: [127.0.0.1:59928](https://127.0.0.1:59928) - "GET /model_info HTTP/1.1" 200 OK
[2026-02-16 17:07:59] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-02-16 17:07:59] INFO: [127.0.0.1:59942](https://127.0.0.1:59942) - "POST /generate HTTP/1.1" 200 OK
[2026-02-16 17:07:59] The server is fired up and ready to roll!

SGLang server is now successfully installed and running.

5 Likes

Below are the performance and accuracy results of NVIDIA Nemotron-3-Nano-30B-A3B-NVFP4 running on a system equipped with two Blackwell Pro GPUs.

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 30000 \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100

Output

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  117.94
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  510855
Total generated tokens (retokenized):    442114
Request throughput (req/s):              8.48
Input token throughput (tok/s):          4348.25
Output token throughput (tok/s):         4331.40
Peak output token throughput (tok/s):    6041.00
Peak concurrent requests:                116
Total token throughput (tok/s):          8679.65
Concurrency:                             95.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11279.95
Median E2E Latency (ms):                 10775.53
P90 E2E Latency (ms):                    20573.32
P99 E2E Latency (ms):                    24219.32
---------------Time to First Token----------------
Mean TTFT (ms):                          486.52
Median TTFT (ms):                        42.17
P99 TTFT (ms):                           5488.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.87
Median TPOT (ms):                        21.17
P99 TPOT (ms):                           49.19
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.21
Median ITL (ms):                         15.48
P95 ITL (ms):                            49.18
P99 ITL (ms):                            79.59
Max ITL (ms):                            5965.74
==================================================

To evaluate model accuracy, I ran the MMLU benchmark using lm_eval with the local OpenAI-compatible completions endpoint.

lm_eval \
  --model local-completions \
  --tasks mmlu \
  --model_args "model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4,base_url=http://127.0.0.1:30000/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False,max_lengths=16384" \
  --gen_kwargs '{"chat_template_kwargs":{"thinking":true}}' \
  --batch_size 256

Output:

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |↑  |0.7044|Β±  |0.0036|
| - humanities                          |      2|none  |      |acc   |↑  |0.6185|Β±  |0.0066|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.6032|Β±  |0.0438|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.7758|Β±  |0.0326|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.8676|Β±  |0.0238|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.8312|Β±  |0.0244|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.8512|Β±  |0.0325|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.7778|Β±  |0.0402|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.7239|Β±  |0.0351|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.7659|Β±  |0.0228|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.3542|Β±  |0.0160|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.7428|Β±  |0.0248|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.7994|Β±  |0.0223|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.5280|Β±  |0.0128|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.8480|Β±  |0.0275|
| - other                               |      2|none  |      |acc   |↑  |0.7564|Β±  |0.0074|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.6800|Β±  |0.0469|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.7585|Β±  |0.0263|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.6532|Β±  |0.0363|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.4600|Β±  |0.0501|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.6996|Β±  |0.0308|
|  - management                         |      1|none  |     0|acc   |↑  |0.7670|Β±  |0.0419|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.8889|Β±  |0.0206|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |0.8500|Β±  |0.0359|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.8608|Β±  |0.0124|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.8333|Β±  |0.0213|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.5071|Β±  |0.0298|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.8382|Β±  |0.0224|
|  - virology                           |      1|none  |     0|acc   |↑  |0.5663|Β±  |0.0386|
| - social sciences                     |      2|none  |      |acc   |↑  |0.8063|Β±  |0.0070|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.5702|Β±  |0.0466|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.8333|Β±  |0.0266|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |0.9067|Β±  |0.0210|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.7872|Β±  |0.0208|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.8613|Β±  |0.0224|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9064|Β±  |0.0125|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.8168|Β±  |0.0339|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.7467|Β±  |0.0176|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.7455|Β±  |0.0417|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.6531|Β±  |0.0305|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.8557|Β±  |0.0248|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9200|Β±  |0.0273|
| - stem                                |      2|none  |      |acc   |↑  |0.6819|Β±  |0.0080|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.5700|Β±  |0.0498|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.6815|Β±  |0.0402|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.8421|Β±  |0.0297|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.8611|Β±  |0.0289|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.5800|Β±  |0.0496|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.6200|Β±  |0.0488|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.5200|Β±  |0.0502|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.5686|Β±  |0.0493|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.7800|Β±  |0.0416|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.8170|Β±  |0.0253|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.7172|Β±  |0.0375|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.6058|Β±  |0.0252|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.8774|Β±  |0.0187|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.6749|Β±  |0.0330|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.7600|Β±  |0.0429|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.4741|Β±  |0.0304|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.6159|Β±  |0.0397|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.6852|Β±  |0.0317|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.5536|Β±  |0.0472|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7044|Β±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.6185|Β±  |0.0066|
| - other          |      2|none  |      |acc   |↑  |0.7564|Β±  |0.0074|
| - social sciences|      2|none  |      |acc   |↑  |0.8063|Β±  |0.0070|
| - stem           |      2|none  |      |acc   |↑  |0.6819|Β±  |0.0080|

The model achieves a solid ~70% MMLU accuracy using NVFP4 quantization.

1 Like

BF16 is about 0.9 percentage points higher than NVFP4 in overall accuracy.

Here is the accuracy on MMLU of BF16 version of the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16:


|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |↑  |0.7135|Β±  |0.0036|
| - humanities                          |      2|none  |      |acc   |↑  |0.6304|Β±  |0.0066|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.5794|Β±  |0.0442|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.7879|Β±  |0.0319|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.8725|Β±  |0.0234|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.8354|Β±  |0.0241|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.8760|Β±  |0.0301|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.7778|Β±  |0.0402|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.7669|Β±  |0.0332|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.7832|Β±  |0.0222|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.3709|Β±  |0.0162|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.7363|Β±  |0.0250|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.8241|Β±  |0.0212|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.5398|Β±  |0.0127|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.8480|Β±  |0.0275|
| - other                               |      2|none  |      |acc   |↑  |0.7576|Β±  |0.0073|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.6900|Β±  |0.0465|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.7736|Β±  |0.0258|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.6590|Β±  |0.0361|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.4300|Β±  |0.0498|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.7130|Β±  |0.0304|
|  - management                         |      1|none  |     0|acc   |↑  |0.8155|Β±  |0.0384|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.8974|Β±  |0.0199|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |0.8200|Β±  |0.0386|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.8595|Β±  |0.0124|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.8268|Β±  |0.0217|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.5177|Β±  |0.0298|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.8235|Β±  |0.0232|
|  - virology                           |      1|none  |     0|acc   |↑  |0.5542|Β±  |0.0387|
| - social sciences                     |      2|none  |      |acc   |↑  |0.8161|Β±  |0.0068|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.5439|Β±  |0.0469|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.8737|Β±  |0.0237|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |0.9016|Β±  |0.0215|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.7923|Β±  |0.0206|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.8824|Β±  |0.0209|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9229|Β±  |0.0114|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.8092|Β±  |0.0345|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.7598|Β±  |0.0173|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.7182|Β±  |0.0431|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.6816|Β±  |0.0298|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.8458|Β±  |0.0255|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9300|Β±  |0.0256|
| - stem                                |      2|none  |      |acc   |↑  |0.6939|Β±  |0.0079|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.5600|Β±  |0.0499|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.7111|Β±  |0.0392|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.8487|Β±  |0.0292|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.8681|Β±  |0.0283|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.6100|Β±  |0.0490|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.6400|Β±  |0.0482|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.5500|Β±  |0.0500|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.5588|Β±  |0.0494|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.7900|Β±  |0.0409|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.8000|Β±  |0.0261|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.7034|Β±  |0.0381|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.6190|Β±  |0.0250|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.8871|Β±  |0.0180|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.7192|Β±  |0.0316|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.8200|Β±  |0.0386|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.4926|Β±  |0.0305|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.6093|Β±  |0.0398|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.6898|Β±  |0.0315|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.5804|Β±  |0.0468|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7135|Β±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.6304|Β±  |0.0066|
| - other          |      2|none  |      |acc   |↑  |0.7576|Β±  |0.0073|
| - social sciences|      2|none  |      |acc   |↑  |0.8161|Β±  |0.0068|
| - stem           |      2|none  |      |acc   |↑  |0.6939|Β±  |0.0079|

Below is the issue encountered when running the sglang benchmark for NVIDIA Nemotron-3-Nano-30B-A3B-NVFP4 on a DGX Spark. The sglang server starts fine initially, but the issue occurs after running multiple requests.

[2026-02-17 17:10:44] INFO:     127.0.0.1:45064 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:44] Prefill batch, #new-seq: 1, #new-token: 689, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 628.58, cuda graph: False
[2026-02-17 17:10:45] INFO:     127.0.0.1:45074 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:45] Prefill batch, #new-seq: 1, #new-token: 119, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 515.77, cuda graph: False
[2026-02-17 17:10:46] INFO:     127.0.0.1:45076 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:46] Prefill batch, #new-seq: 1, #new-token: 147, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 183.49, cuda graph: False
[2026-02-17 17:10:46] INFO:     127.0.0.1:45084 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:46] INFO:     127.0.0.1:45086 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:46] Prefill batch, #new-seq: 1, #new-token: 352, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 366.23, cuda graph: False
[2026-02-17 17:10:47] Prefill batch, #new-seq: 1, #new-token: 652, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 100, #queue-req: 0, input throughput (token/s): 3028.01, cuda graph: False
[2026-02-17 17:10:49] INFO:     127.0.0.1:45088 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:49] Decode batch, #running-req: 99, #full token: 81467, full token usage: 0.01, mamba num: 198, mamba usage: 0.37, cuda graph: True, gen throughput (token/s): 654.73, #queue-req: 0
[2026-02-17 17:10:50] Prefill batch, #new-seq: 1, #new-token: 652, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 214.54, cuda graph: False
[2026-02-17 17:10:50] INFO:     127.0.0.1:45104 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:51] Prefill batch, #new-seq: 1, #new-token: 367, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 607.58, cuda graph: False
[2026-02-17 17:10:52] INFO:     127.0.0.1:45112 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:52] INFO:     127.0.0.1:45114 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:52] Prefill batch, #new-seq: 1, #new-token: 437, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 301.61, cuda graph: False
[2026-02-17 17:10:52] Prefill batch, #new-seq: 1, #new-token: 845, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 100, #queue-req: 0, input throughput (token/s): 3609.84, cuda graph: False
[2026-02-17 17:10:53] INFO:     127.0.0.1:50998 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:53] Prefill batch, #new-seq: 1, #new-token: 581, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 567.50, cuda graph: False
[2026-02-17 17:10:54] INFO:     127.0.0.1:51008 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:55] Prefill batch, #new-seq: 1, #new-token: 174, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 486.50, cuda graph: False
[2026-02-17 17:10:56] Decode batch, #running-req: 100, #full token: 84983, full token usage: 0.01, mamba num: 200, mamba usage: 0.37, cuda graph: True, gen throughput (token/s): 643.69, #queue-req: 0
[2026-02-17 17:10:57] INFO:     127.0.0.1:51012 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:57] Prefill batch, #new-seq: 1, #new-token: 193, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 80.44, cuda graph: False
[2026-02-17 17:10:57] INFO:     127.0.0.1:51022 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:57] INFO:     127.0.0.1:51038 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:57] Prefill batch, #new-seq: 2, #new-token: 643, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 98, #queue-req: 0, input throughput (token/s): 451.78, cuda graph: False
[2026-02-17 17:10:58] INFO:     127.0.0.1:51048 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:10:58] Prefill batch, #new-seq: 1, #new-token: 776, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 651.42, cuda graph: False
[2026-02-17 17:11:00] INFO:     127.0.0.1:51064 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:00] Prefill batch, #new-seq: 1, #new-token: 774, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 507.31, cuda graph: False
[2026-02-17 17:11:01] INFO:     127.0.0.1:51066 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:01] Prefill batch, #new-seq: 1, #new-token: 247, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 639.07, cuda graph: False
[2026-02-17 17:11:02] Decode batch, #running-req: 100, #full token: 85531, full token usage: 0.01, mamba num: 200, mamba usage: 0.37, cuda graph: True, gen throughput (token/s): 646.75, #queue-req: 0
[2026-02-17 17:11:04] INFO:     127.0.0.1:54114 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:04] Prefill batch, #new-seq: 1, #new-token: 290, #cached-token: 0, full token usage: 0.01, mamba usage: 0.37, #running-req: 99, #queue-req: 0, input throughput (token/s): 79.40, cuda graph: False
[2026-02-17 17:11:05] INFO:     127.0.0.1:54118 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:05] INFO:     127.0.0.1:54134 - "POST /generate HTTP/1.1" 200 OK
[2026-02-17 17:11:05] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 3162, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/home/spark/.sglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 1112, in event_loop_normal
    self.process_batch_result(batch, result)
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 2472, in process_batch_result
    self.process_batch_result_prefill(batch, result)
  File "/home/spark/Projects/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 148, in process_batch_result_prefill
    next_token_ids = next_token_ids.tolist()
                     ^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal instruction was encountered
Search for `cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[2026-02-17 17:11:05] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed
1 Like

I think this version will still have problem with Sglang[diffusion]. You will need to build sgl-kernel from source which I am still trying.

This link provide more information:

this error is because sgl-kernel not match with your pytorch + cuda version installed

1 Like

Thanks for the suggestion. After checking my installation steps, I found the problem. I did not specify the torch version since sgl-kernel needs torch == 2.9.1

I was able to the make guild the sgl-kernel after I follow exactly the process specified in this post.

I only have the following to add:

If you need to use stable diffusion models, you will need to add the following after you have completed the installation

uv pip install remote_pdb
uv pip install imageio
uv pip install diffusers
uv pip install addict
uv pip install cache_dit

I got some interesting result. My previous attempt always crash at

[02-18 13:35:00] [DenoisingStage] started…

The new build was successful in DenoisingState, DecodingStage but got the following right after DecodingStage

[02-18 13:35:33] [DecodingStage] finished in 2.0944 seconds
[02-18 13:35:33] Error executing request 33b9258f-0f2c-45f1-80e0-5e2a3ab2cfe1: Not Supported
Traceback (most recent call last):
File β€œ/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py”, line 243, in execute_forward
self.do_mem_analysis(output_batch)
File β€œ/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py”, line 165, in do_mem_analysis
current_platform.get_device_total_memory() / (1024**3) - peak_reserved_gb
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File β€œ/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/platforms/cuda.py”, line 64, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File β€œ/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/runtime/platforms/cuda.py”, line 437, in get_device_total_memory
return int(pynvml.nvmlDeviceGetMemoryInfo(handle).total)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File β€œ/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/third_party/pynvml.py”, line 3782, in nvmlDeviceGetMemoryInfo
_nvmlCheckReturn(ret)
File β€œ/home/paul/sglang_test/sglang/python/sglang/multimodal_gen/third_party/pynvml.py”, line 1305, in _nvmlCheckReturn
raise NVMLError(ret)
sglang.multimodal_gen.third_party.pynvml.NVMLError_NotSupported: Not Supported
[02-18 13:35:33] Output saved to outputs/sample_0_33b9258f-0f2c-45f1-80e0-5e2a3ab2cfe1.jpg

The generated image was stored in the server but never send back through the HTTP.

Maybe I made some mistake with the http request:

curl http://192.168.1.109:30000/v1/images/generations
-o >(jq -r β€˜.data[0].b64_json’ | base64 --decode > example.png)
-H β€œContent-Type: application/json”
-d β€˜{
β€œmodel”: β€œTongyi-MAI/Z-Image-Turbo”,
β€œprompt”: β€œA cute baby sea otter”,
β€œn”: 1,
β€œsize”: β€œ1024x1024”,
β€œresponse_format”: β€œb64_json”
}’ | jq -r β€˜.choices[0].message.content[0].image_url.url’ | cut -d’,’ -f2 | base64 -d > otter.png

I will do more testing to trace the problem.

Does SGLang use Flashinfer as well? It could be that it hits the same issue as vLLM.

Yes

There are fixes missing in FlashInfer. See the bottom of 18203 for the list.

Also, work is being done in 18862 to make SGLang compatible with PyTorch 2.10

Yes, I reported it to the Nvidia Nemotron team( cc @calexiuk). Surprisingly, it works fine on the x86 machine.

LTX-2 claimed that they will soon release open weight model that rival Seedance 2.0. Sglang and vLLM might be able to run it with Cache-DiT.

I hope all these problem get fixed by then.

It is funny that Hollywood, TV (Disney, Netflix,etc…) will be the first industry that face existential crisis with AI. Programmers are still safe.

Disney threaten lawsuit is like blood that excit the great white sharks in China.

It is my opinion that DGX Spark is a dangerous machine because you can run it with solar power setup and easy to carry. You can use it to generate video/image/audio that is capable to cause great harm through social media

2 Likes

The error message still exist but I can confirm that Sglang server do send the generated image through HTTP.

A working curl is the following:

curl http://192.168.1.109:30000/v1/images/generations
-o >(jq -r β€˜.data[0].b64_json’ | base64 --decode > cat1.png)
-H β€œContent-Type: application/json”
-d β€˜{
β€œmodel”: β€œTongyi-MAI/Z-Image-Turbo”,
β€œprompt”: β€œcartoon cat eats fish”,
β€œn”: 1,
β€œsize”: β€œ1024x1024”,
β€œresponse_format”: β€œb64_json”
}’

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.