Run SGLang in Thor

johnny_nv · October 24, 2025, 7:46am

Run SGLang Thor & Spark

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Create environment

uv venv .sglang --python 3.12
source .sglang/bin/activate
sudo apt install python3-dev python3.12-dev

Export variables

export TORCH_CUDA_ARCH_LIST=11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Install SGLang

uv pip install sgl-kernel --prerelease=allow --index-url https://docs.sglang.ai/whl/cu130/
uv pip install sglang --prerelease=allow 
uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130
uv pip install flashinfer-python

Clean memory

sudo sysctl -w vm.drop_caches=3

Run gptoss 120b nvfp4

mkdir -p ~/tiktoken_encodings
wget -O ~/tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O ~/tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"

python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --host 0.0.0.0 --port 30000 --reasoning-parser gpt-oss --tool-call-parser gpt-oss

shahizat · October 25, 2025, 4:38pm

Hello, thanks @johnny_nv, it was useful.

Installation Steps for sgl-kernel on Jetson Thor 🚀

Install cmake

ARCH=$(uname -m)
wget ``https://cmake.org/files/v3.31/cmake-3.31.1-linux-${ARCH}.tar.gz
tar -xzf cmake-3.31.1-linux-${ARCH}.tar.gz
mv cmake-3.31.1-linux-${ARCH} /opt/cmake
export PATH=/opt/cmake/bin:$PATH

Install the required system dependencies,

sudo apt-get install -y libnuma-dev

Set essential environment variables for the CUDA, Triton, and build processes.

export TORCH_CUDA_ARCH_LIST=11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CMAKE_BUILD_PARALLEL_LEVEL=1
export MAX_JOBS=4
export CPLUS_INCLUDE_PATH=/usr/local/cuda-13.0/targets/sbsa-linux/include/cccl

Navigate to the sgl-kernel source directory and use uv build to compile the library

cd sgl-kernel

uv build --wheel --no-build-isolation . --out-dir “./wheels” 
–config-settings=cmake.args=“-G;Ninja” 
–config-settings=cmake.define.TORCH_CUDA_ARCH_LIST=“${TORCH_CUDA_ARCH_LIST}” 
–config-settings=cmake.define.CUDA_VERSION=“13.0” 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_BF16=1 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FP8=1 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FP4=1 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FA3=0 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_SM90A=0 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_SM100A=1 
–config-settings=cmake.define.ENABLE_BELOW_SM90=OFF 
–config-settings=cmake.define.CMAKE_POLICY_VERSION_MINIMUM=3.5

Run sglang server

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000  --mem-fraction 0.6  --attention-backend triton

Output

[2025-10-25 21:23:47] Using default HuggingFace chat template with detected content format: string
[2025-10-25 21:23:51] INFO trace.py:48: opentelemetry package is not installed, tracing disabled
[2025-10-25 21:23:51] INFO trace.py:48: opentelemetry package is not installed, tracing disabled
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-25 21:23:54] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-25 21:23:54] Init torch distributed ends. mem usage=0.00 GB
[2025-10-25 21:23:54] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-10-25 21:23:55] Load weight begin. avail mem=116.85 GB
[2025-10-25 21:23:56] Using model weights format ['*.safetensors']
Ignored error while writing commit hash to /home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main: [Errno 13] Permission denied: '/home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main'.
[2025-10-25 21:23:57] Ignored error while writing commit hash to /home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main: [Errno 13] Permission denied: '/home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main'.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.33it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:04,  2.20s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  3.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.72s/it]

[2025-10-25 21:24:08] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=101.17 GB, mem usage=15.68 GB.
[2025-10-25 21:24:08] Using KV cache dtype: torch.bfloat16
[2025-10-25 21:24:10] KV Cache is allocated. #tokens: 445695, K size: 27.20 GB, V size: 27.20 GB
[2025-10-25 21:24:10] Memory pool end. avail mem=45.01 GB
[2025-10-25 21:24:10] Capture cuda graph begin. This can take up to several minutes. avail mem=44.83 GB
[2025-10-25 21:24:10] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=44.37 GB):   0%|                                              | 0/36 [00:00<?, ?it/s][2025-10-25 21:24:11] MOE_A2A_BACKEND is not initialized, using default backend
Capturing batches (bs=1 avail_mem=43.70 GB): 100%|███████████████████████████████████████| 36/36 [00:09<00:00,  3.75it/s]
[2025-10-25 21:24:20] Capture cuda graph end. Time elapsed: 10.15 s. mem usage=1.12 GB. avail mem=43.70 GB.
[2025-10-25 21:24:21] max_total_num_tokens=445695, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=131072, available_gpu_mem=43.61 GB
[2025-10-25 21:24:22] INFO:     Started server process [150349]
[2025-10-25 21:24:22] INFO:     Waiting for application startup.
[2025-10-25 21:24:22] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-25 21:24:22] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-25 21:24:22] INFO:     Application startup complete.
[2025-10-25 21:24:22] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-10-25 21:24:23] INFO:     127.0.0.1:54984 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-25 21:24:23] Prefill batch [1], #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-25 21:24:27] INFO:     127.0.0.1:54986 - "POST /generate HTTP/1.1" 200 OK
[2025-10-25 21:24:27] The server is fired up and ready to roll!
[2025-10-25 21:25:01] INFO:     127.0.0.1:42242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-25 21:25:01] Prefill batch [10], #new-seq: 1, #new-token: 54, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-25 21:25:04] Decode batch [43], #running-req: 1, #token: 88, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.93, #queue-req: 0, 
[2025-10-25 21:25:08] Decode batch [83], #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.03, #queue-req: 0, 
[2025-10-25 21:25:11] Decode batch [123], #running-req: 1, #token: 168, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.00, #queue-req: 0, 
[2025-10-25 21:25:15] Decode batch [163], #running-req: 1, #token: 208, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.97, #queue-req: 0, 
[2025-10-25 21:25:19] Decode batch [203], #running-req: 1, #token: 248, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.98, #queue-req: 0, 
[2025-10-25 21:25:22] Decode batch [243], #running-req: 1, #token: 288, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.98, #queue-req: 0, 
[2025-10-25 21:25:26] Decode batch [283], #running-req: 1, #token: 328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.97, #queue-req: 0, 
[2025-10-25 21:25:29] Decode batch [323], #running-req: 1, #token: 368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.95, #queue-req: 0, 
[2025-10-25 21:25:33] Decode batch [363], #running-req: 1, #token: 408, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.96, #queue-req: 0, 
[2025-10-25 21:25:37] Decode batch [403], #running-req: 1, #token: 448, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.00, #queue-req: 0, 
[2025-10-25 21:25:40] Decode batch [443], #running-req: 1, #token: 488, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.96, #queue-req: 0,

johnny_nv · October 26, 2025, 1:40am

SGLANG Released: cu130 kernels can be downloaded here
https://github.com/sgl-project/whl/blob/gh-pages/cu130/sgl-kernel/index.html

zhanglei9504 · October 31, 2025, 4:34am

I install sgl_kernel cu130 wheel, run command:

```python -m sglang.launch_server --model /root/zl/Qwen3-VL-8B-Instruct```

bug got error:

    from sgl_kernel import gelu_and_mul, silu_and_mul
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/__init__.py", line 196, in <module>
    common_ops = _load_architecture_specific_ops()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/__init__.py", line 191, in _load_architecture_specific_ops
    raise ImportError(error_msg)
ImportError:
[sgl_kernel] CRITICAL: Could not load any common_ops library!

Attempted locations:
1. Architecture-specific pattern: /usr/local/lib/python3.12/dist-packages/sgl_kernel/sm100/common_ops.* - found files: ['/usr/local/lib/python3.12/dist-packages/sgl_kernel/sm100/common_ops.abi3.so']
2. Fallback pattern: /usr/local/lib/python3.12/dist-packages/sgl_kernel/common_ops.* - found files: []
3. Standard Python import: common_ops - failed

GPU Info:
- Compute capability: 110
- Expected variant: SM110 (precise math for compatibility)

Please ensure sgl_kernel is properly installed with:
pip install --upgrade sgl_kernel

Error details from previous import attempts:
- ImportError: /usr/local/lib/python3.12/dist-packages/sgl_kernel/sm100/common_ops.abi3.so: undefined symbol: _ZNK3c106SymInt22maybe_as_int_slow_pathEv
- ModuleNotFoundError: No module named 'common_ops'

johnny_nv · October 31, 2025, 7:53am

did you installed from this? Release v0.3.16.post4 · sgl-project/whl · GitHub

did you install pytorch 2.9.0 cu130?

did you export correctly your export variables?

zhanglei9504 · October 31, 2025, 8:50am

I use the NGC pytorch image 25.09 as base image, and install sglang from source.

git clone --recursive https://github.com/sgl-project/sglang.git
cd sglang 
pip install -e "python[cu130]" 
pip install sgl_kerne sgl_kernel-0.3.16.post4+cu130-cp310-abi3-manylinux2014_aarch64.whl

export TORCH_CUDA_ARCH_LIST=11.0a # Thorexport TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxasexport PATH=/usr/local/cuda/bin:$PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH


python -m sglang.launch_server --model Qwen3-Vl-8B

python env:

setuptools                 79.0.1
sgl-kernel                 0.3.16.post4
sglang                     0.5.4                     /home/nvidia/zl/sglang/python
six                        1.16.0
sniffio                    1.3.1
sortedcontainers           2.4.0
soundfile                  0.13.1
soupsieve                  2.8
soxr                       0.5.0.post1
stack-data                 0.6.3
starlette                  0.49.1
sympy                      1.14.0
tabulate                   0.9.0
tensorboard                2.20.0
tensorboard-data-server    0.7.2
tensorrt                   10.13.3.9
terminado                  0.18.1
threadpoolctl              3.6.0
tiktoken                   0.12.0
timm                       1.0.16
tinycss2                   1.4.0
tokenizers                 0.22.1
torch                      2.9.0a0+50eac811a6.nv25.9
torch_memory_saver         0.0.9
torch_tensorrt             2.9.0a0
torchao                    0.9.0
torchprofile               0.0.4
torchvision                0.24.0a0+98f8b375
tornado                    6.5.2
tqdm                       4.67.1
traitlets                  5.14.3
transformer_engine         2.7.0+fedd9dd
transformers               4.57.1
types-python-dateutil      2.9.0.20250822
typing_extensions          4.15.0
typing-inspection          0.4.1
tzdata                     2025.2
uri-template               1.3.0
urllib3                    2.5.0
uv                         0.8.17
uvicorn                    0.38.0
uvloop                     0.22.1
wcwidth                    0.2.13
webcolors                  24.11.1
webencodings               0.5.1
websocket-client           1.8.0
Werkzeug                   3.1.3
wheel                      0.45.1
wrapt                      1.17.3
xdoctest                   1.0.2
xgrammar                   0.1.25
xxhash                     3.6.0
yarl                       1.22.0
zipp                       3.23.0

johnny_nv · October 31, 2025, 9:45am

25.10 was released: SGLang | NVIDIA NGC

helge · November 10, 2025, 7:19pm

I followed the installation steps of the initial post but when I run a model with the sglang.launch_server module, it terminates with:
”RuntimeError: No accelerator (CUDA, XPU, HPU) is available.”

whitesscott · November 11, 2025, 6:04am

I don’t know sglang but this might help.

env |grep -i cuda

If following values aren't returned, set them and add to ~/.bashrc

CUDA_HOME=/usr/local/cuda
LIBRARY_PATH=/usr/local/cuda-13.0/targets/sbsa-linux/lib
PATH=$PATH:/usr/local/cuda/bin

john_c · November 13, 2025, 6:24am

I have the same problem.

      raise RuntimeError("No accelerator (CUDA, XPU, HPU) is available.")
RuntimeError: No accelerator (CUDA, XPU, HPU) is available.

johnny_nv · November 13, 2025, 7:32pm

Did you install pytorch 2.9.1 cu130?
Did you build it with that version?

helge · November 15, 2025, 6:06pm

I used the installation procedure exactly described in the initial post of this thread and the way it installs torch. The version that was installed this way is: 2.9.0+cu130

johnny_nv · November 18, 2025, 3:36pm

uv pip install sgl-kernel --prerelease=allow --index-url https://docs.sglang.ai/whl/cu130/
uv pip install sglang --prerelease=allow 
uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130

Topic		Replies	Views
Run SGLang in Spark DGX Spark / GB10	19	2960	November 14, 2025
Build SGLang from source on Blackwell Pro 6000/ DGX Spark DGX Spark / GB10 jetson , nemotron	13	1317	February 18, 2026
[Help Needed] Building vLLM dependencies inside SGLang official image for Eagle-3 Speculative Decoding Jetson Thor llm	13	482	December 15, 2025
Setting up multiple instances of the SGLang server using router on the NVIDIA Jetson AGX Orin 64GB dev kit Jetson Projects	0	588	June 9, 2025
New pre-built sglang Docker Images for NVIDIA DGX Spark DGX Spark / GB10 Projects	27	2491	May 7, 2026
Error when use SGLang:26.03-py3 to deploy Qwen3.5-35B-A3B-FP8 DGX Spark / GB10	1	148	April 1, 2026
Sglang:26.02-py3 requires installation of 3 python modules Jetson Thor python , llm	3	116	March 28, 2026
Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError) DGX Spark / GB10 cuda	14	461	December 18, 2025
SOTA inference speed using SGlang and EAGLE-3 speculative decoding on the NVIDIA Jetson AGX Orin Jetson Projects llama-31-8b-instruct , llama	2	1155	March 23, 2025
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	2160	December 7, 2025

Run SGLang in Thor

Run SGLang Thor & Spark

Related topics